This is the exploratory analysis of loans data set as part of the Udacity Data Analysis Nano Degree programme.
This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, EstimatedReturn, EffectiveYield, and many others.
What factors affect a loan’s outcome status?
What affects the borrower’s APR or interest rate?
What determines profits and returns on investment for lenders?
Are there differences between loans depending on how large the original loan amount was?
# import all packages and set plots to be embedded inline.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Print version numbers of packages used in this analysis for reproducability purposes.
# pandas: 1.1.5
# numpy: 1.22.1
# matplotlib: 3.4.2
# seaborn: 0.11.2
# python: 3.8.5
import matplotlib as mplt
import platform
print('\n',
'pandas:', pd.__version__, '\n',
'numpy:', np.__version__ , '\n',
'matplotlib:', mplt.__version__, '\n',
'seaborn:', sns.__version__, '\n',
'python:', platform.python_version())
sns.set_style("whitegrid", {'axes.grid' : False})
sns.set_style("ticks")
sns.set(font_scale=1.2, style="whitegrid")
sns.set_theme() # rests to default
get_ipython().ast_node_interactivity = 'all'
# Read csv file into a pandas dataframe.
dfc = pd.read_csv('./uda_data/prosperLoanData.csv')
# lets view a sample of the data to get a sense of what analysis to carry out.
dfc.head(5)
# lets review all the columns names in the dataframe and their data types.
dfc.info()
# let's review the dataframe shape (number of columns and number of rows).
dfc.shape
# are there NA values present in the dataset?
# Yes, there is. But to view this better we need to isoloate the NaN columns out
nans = dfc.isna().sum()
nans/dfc.shape[0]
# let's find the the number of NaNs for each column where there is NaN
dfc[dfc.columns[dfc.isna().any()]].isna().sum()
# let's calculate the NaNs percentage for columns with NaN is present
dfc[dfc.columns[dfc.isna().any()]].isna().sum() * 100 / dfc.shape[0]
# these are the columns with more than 20% NaNs
(nans/dfc.shape[0]).where(lambda x : x>0.20).dropna()
Inspecting the variable definitions Excel sheet, One can explain the very high NaNs present in some of the columns:
1- these variables are populated for post Jul 2009
EstimatedEffectiveYield 25.526387
EstimatedLoss 25.526387
EstimatedReturn 25.526387
ProsperRating (numeric) 25.526387
ProsperRating (Alpha) 25.526387
ProsperScore 25.526387
2- These variables are populated for pre Jul 2009
CreditGrade 74.588588
3- GroupKey is assigned if member of group otheriwse null
GroupKey 88.290898
4- Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses.
ClosedDate 51.649596
5- These variables values will be null if the borrower had no prior loans.
TotalProsperLoans 80.616481
TotalProsperPaymentsBilled 80.616481
OnTimeProsperPayments 80.616481
ProsperPaymentsLessThanOneMonthLate 80.616481
ProsperPaymentsOneMonthPlusLate 80.616481
ProsperPrincipalBorrowed 80.616481
ProsperPrincipalOutstanding 80.616481
ScorexChangeAtTimeOfListing 83.387311
6- This variable captures the cycle the loan was charged off. If the loan has not charged off the value will be null.
LoanFirstDefaultedCycleNumber 85.121602
The above high NaN's columns indicate that for some varaible analysis we will need to do pre or post slicing of data in comparisons to ensure statistical findings are not skewed incorrenctly by comparing two different datasets.
7- The small percentage NaNs columns indicate the absense of some data about some individuals. Those are find and will be further examined or filtered during those variables analysis.
The dataset is 113937 rows by 81 variables.
There are null values present in the dataset. High NaNs present for some columns were found to be justified and mainly centered around the split of data between pre/post 2009 as explained above. The analysis will have to address data points pre and post 2009 separatly.
Where is small NaN percentages present (less than 7%) we will either omit those during the analysis of these variables or apply any other required treatment to these variables during the exploration phase.
Features of interest can vary according to the purpose of the analysis. For example we can choose any of these as key features interest:
1- LenderYield, and EstimatedReturn: If the purpose is investigating Lenders Yield and understanding factors that impact it.
2- BorrowerAPR, BorrowerRate: If the purpose is investigating borrowers interest rate and what impacts it.
3- MonthlyLoanPayment: If the purpose is investigating the monthly payment amount and which variables impact its values.
4- LoanOriginalAmount: If the purpose of analysis to investigate the amounts of loan seeked/offered and improve service offering for different loan demand.
5- TotalInquiries: If the purpsoe of the analysis to investigate level of inquiries, and potential options to decrease these inquiries in effort to digitise services and reduce cost.
For this analysis, we will take LenderYield, EstimatedReturn, EstimatedLoss, and EstimatedEffectiveYield as our key features of interest for the purpose of this analysis. It would be intereting to understand how those metrics are affected by some of the dimensions in the dataset e.g. credit score/ credit rating, having a house or not, having a job or not, income levels ... etc and what are the main drivers behind Returns levels from those loans.
These are the feature that I think can be used to support my analysis and used for slicing, dicing, and segmenting the data:
LoanStatus, ListingCategory, Term, Recommendations, LenderYield, EstimatedReturn, CreditScoreRangeLower, CreditScoreRangeUpper, EmploymentStatus, ProsperRating, ProsperScore, IsBorrowerHomeowner, CurrentDelinquencies, DelinquenciesLast7Years, TradesOpenedLast6Months, DebtToIncomeRatio, IncomeRange, IncomeVerifiable.
In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.
Rubric Tip: The project (Parts I alone) should have at least 15 visualizations distributed over univariate, bivariate, and multivariate plots to explore many relationships in the data set. Use reasoning to justify the flow of the exploration.
Rubric Tip: Use the "Question-Visualization-Observations" framework throughout the exploration. This framework involves asking a question from the data, creating a visualization to find answers, and then recording observations after each visualisation.
# double check df size:
dfc.shape
## Clean ##
## 2 ##
# which are the popular categories loans are taken for?
# lets convert the loan category numbers into their actual names for user friendleness and ease of understanding plots.
cat_dict = {0: 'Not Available', 1: 'Debt Consolidation', 2: 'Home Improvement', 3: 'Business',
4: 'Personal Loan', 5: 'Student Use', 6: 'Auto', 7: 'Other', 8: 'Baby&Adoption', 9: 'Boat',
10: 'Cosmetic Procedure', 11: 'Engagement Ring', 12: 'Green Loans', 13: 'Household Expenses',
14: 'Large Purchases', 15: 'Medical/Dental', 16: 'Motorcycle', 17:'RV', 18:'Taxes', 19:'Vacation', 20: 'Wedding Loans'}
dfc['ListingCategory'] = dfc['ListingCategory (numeric)'].map(cat_dict)
## Clean ##
## 3 ##
# convert date string to date object to work with.
dfc['LoanDate'] = pd.to_datetime(dfc['LoanOriginationDate'].str.split(' ', n=1, expand = True)[0], format='%Y-%m-%d')
dfc['year'] = dfc['LoanDate'].map(lambda x: x.strftime('%Y'))
# are there any NaNs instroduced in the date conversion process?
# No, all good.
dfc['LoanDate'].isnull().sum(), dfc['LoanDate'].notnull().sum()
dfc['year'].value_counts()
## Clean ##
## 4 ##
# convert some variables into ordered categorical types so that they appear in a logical order during analysis and when plotting:
def sort_categorical_vars():
ordinal_var_dict = {'LoanStatus': ['Completed', 'Current', 'Defaulted', 'Chargedoff', 'Cancelled', 'FinalPaymentInProgress',
'Past Due (1-15 days)', 'Past Due (16-30 days)', 'Past Due (31-60 days)',
'Past Due (61-90 days)', 'Past Due (91-120 days)', 'Past Due (>120 days)'],
'CreditGrade': ['AA', 'A', 'B', 'C', 'D', 'E', 'HR', 'NC'],
'ProsperRating (Alpha)': ['AA', 'A', 'B', 'C', 'D', 'E', 'HR'],
'IncomeRange': ['$0', '$1-24,999', '$25,000-49,999', '$50,000-74,999', '$75,000-99,999', '$100,000+', 'Not employed', 'Not displayed'],
'year': ['2006','2007','2008','2009','2010','2011','2012','2013','2014']
#'col2': ['J', 'I', 'H', 'G', 'F', 'E', 'D'],
}
for var in ordinal_var_dict:
pd_ver = pd.__version__.split(".")
if (int(pd_ver[0]) > 0) or (int(pd_ver[1]) >= 21): # v0.21 or later
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
dfc[var] = dfc[var].astype(ordered_var)
else: # pre-v0.21
dfc[var] = dfc[var].astype('category', ordered = True,
categories = ordinal_var_dict[var])
sort_categorical_vars()
## Clean ##
## 5 ##
# NaNs analysis revealed that we need to treat pre and post Jul 2009 data observations differently.
# let's define a day to cut the dataframe with
from datetime import datetime
# datetime(year, month, day)
d2009Jan = datetime(2009, 1, 1 )
d2009Aug = datetime(2009, 8, 1 )
print(d2009Jan, '<->', d2009Aug)
# add extra categorical column for differentiating the two datasets in the amin dfc
#dfc['period'] = np.select ([dfc['LoanDate'] >= d2009Aug, dfc['LoanDate'] < d2009Jan], ['post-Jul2009', 'pre-Jul2009'])
dfc['period'] = np.where(dfc['LoanDate'] >= d2009Aug, 'post-Jul2009', 'pre-Jul2009')
# lets slice the dataframe into two datasets, one for the period upto Dec 2008, and another for the period starting Aug 2009
# This will help us keep the analysis meaningful and metrics consistent as we go along.
dfc_new = dfc.loc[dfc['LoanDate'] >= d2009Aug,:].copy()
dfc_old = dfc.loc[dfc['LoanDate'] < d2009Jan,:].copy()
# validate split
dfc_new.shape, dfc_old.shape
# check count excl. NaNs
# same as dfc_old.shape, which looks good.
dfc_old['Term'].value_counts()
# no data between 1/1/2009 and 1/8/2009 at least.
dfc[(dfc['LoanDate'] >= d2009Aug) & (dfc['LoanDate'] < d2009Jan)]
## Clean ##
## 6 ##
# split new data based on 2009-2010 group and 2011-2014
dfc_new['yearSplit'] = np.where((dfc_new['year'] == '2009') | (dfc_new['year'] == '2010'), '2009-2010', '2011-2014')
# validate
dfc_new.groupby(['yearSplit'])['year'].value_counts()
import scipy.signal.signaltools
# hack for statsmodels 0.12.x to allow patchworklib to work.
def _centered(arr, newsize):
# Return the center newsize portion of the array.
newsize = np.asarray(newsize)
currsize = np.array(arr.shape)
startind = (currsize - newsize) // 2
endind = startind + newsize
myslice = [slice(startind[k], endind[k]) for k in range(len(endind))]
return arr[tuple(myslice)]
scipy.signal.signaltools._centered = _centered
#################
#################
# Does the data set has many members that take more than one loan?
# Vast majority of customers in the data set have one loan, and roughly 2/7th of customers will have more than one loan.
mdist = dfc.groupby(['MemberKey'])['MemberKey'].count().sort_values(ascending=False).reset_index(name='counts')
ax = sns.histplot(data=mdist, x='counts', color='deepskyblue')
ax.set_xticks(sorted(mdist['counts'].unique())) # set xtickets first!! before setting xlabel below
ax.set_xticklabels(sorted(mdist['counts'].unique()))
ax.set(xlabel="count of loan", ylabel="Count", title='Histogram for number of loads for each member');
# lets get a view of the rows with any NaNs and determine if pre/post Jul 2009 is a valid assumption as found in the data cleaning stage
# This shows where some variables have NaNs, other variables hold valuable data that can be useful for analysis in the same row! Therefore removing rows with
# NaN could introduce bias in our data.
dfc[dfc.isna().any(axis=1)]#.sum()
# We can filter out the NaNs rows further by looking at specific LoanStatuses.
# But this did not seem necessarly and does not give us any more info about the data set. a dead-end!
dfc[dfc.isna().any(axis=1)].query("LoanStatus == 'Completed'")#.sum()
get_ipython().ast_node_interactivity = 'all' #'last_expr' to revert to the default behaviour. Other possible options are 'none', 'last' and 'last_expr_or_assign'.
# let's look at the Estimated Return = NaN, which has a 25% NaNs in the data set, broken down by LoanStatus=Completed.
f"{dfc[dfc['EstimatedReturn'].isnull()]['LoanStatus'].count():.2f}", f"{dfc[dfc['EstimatedReturn'].isnull()]['LoanStatus'].count()/dfc.shape[0]:.2%}" # 0.2f for decimal, or 6.2f for decimal with leading spaces.
dfc['LoanStatus'].where(lambda x: x == 'Completed').count(), '<= Compeleted status Total Count'
dfc[dfc['EstimatedReturn'].isnull()]['LoanStatus'].where(lambda x: x == 'Completed').count(), '<= EstimatedReturn = NaN count' #isna()
dfc[dfc['EstimatedReturn'].notnull()]['LoanStatus'].where(lambda x: x == 'Completed').count(), '<= EstimatedReturn != NaN count' #notna()
# So, about 50% of LoanStatus=Completed has Estimated Return set to NaN.
# This again enforces that removing rows with NaNs from the data set is not wise.
# 'so we will deal with NaNs on a per analysis or plot basis.
get_ipython().ast_node_interactivity = 'all'
# let's look at the Estimated Return = NaN, which has a 25% NaNs in the data set, brokendown by LoanStatus=Current this time!
dfc['LoanStatus'].where(lambda x: x == 'Current').count(), '<= Current status Total Count'
dfc[dfc['EstimatedReturn'].isnull()]['LoanStatus'].where(lambda x: x == 'Current').count(), '<= EstimatedReturn = NaN count' #isna()
dfc[dfc['EstimatedReturn'].notnull()]['LoanStatus'].where(lambda x: x == 'Current').count(), '<= EstimatedReturn != NaN count' #notna()
# a full breakdown of EstimatedReturn = NaNs by LoanStatus!
# We can see that only 4 LoanStatuses have EstimatedReturn with to NaNs.
# Those are statuses that occured in the past! which makes sense given that EstimatedReturn is not filled for pre Jul 2009 data.
# So no unexpected NaNs in the post Jul 2009 is present.
dfc[dfc['EstimatedReturn'].isna()].groupby(['LoanStatus'])['LoanStatus'].count()
# a full breakdown of EstimatedReturn != NaNs by LoanStatus! for analysis completion!
# we can see that EstimatedReturn is distributed acorss all entries for post Jul 2009 data.
dfc[dfc['EstimatedReturn'].notna()].groupby(['LoanStatus'])['LoanStatus'].count()
dfc.LoanStatus.unique()
# double check that CreditGrade categorical variable is still an ordered list
dfc_old.CreditGrade
# we get categories listed at the bottom in provided order, like do
# ['AA' < 'A' < 'B' < 'C' < 'D' < 'E' < 'HR' < 'NC']
# let's plot the above numbers for visual inspection and ease of comparison!
plt.figure(figsize=(20,15))
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=1.1, wspace=0.2) #control gap between subplots
ax1=plt.subplot(2,2,1)
grid = sns.countplot(data=dfc, x='LoanStatus', color='deepskyblue')
#plt.xticks(rotation=90); # OR beter
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Count plot of Loan Statuses')
grid.set(xlabel='Loan Statuses', ylabel='Count');
ax2=plt.subplot(2,2,3)
grid = sns.countplot(data=dfc[dfc['EstimatedReturn'].notna()], x='LoanStatus', color='deepskyblue')
#plt.xticks(rotation=90); # OR beter
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Count plot of Loan Statuses for EstimatedReturn not NaN')
grid.set(xlabel='Loan Statuses', ylabel='Count (EstimatedReturn not NaN)');
ax2.sharey(ax1)
ax3=plt.subplot(2,2,4)
grid = sns.countplot(data=dfc[dfc['EstimatedReturn'].isna()], x='LoanStatus', color='deepskyblue')
#plt.xticks(rotation=90); # OR beter
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Count plot of Loan Statuses for EstimatedReturn NaN')
grid.set(xlabel='Loan Statuses', ylabel='Count (EstimatedReturn is NaN)');
ax3.sharey(ax1);
The above analysis did not lead to many findings. However confirmed that NaNs are prsent predomenantly due to pre/post Jul 2009 data split.
Therefore, we still keep our initial conclusion of splitting the data to pre/post 2009 different sets for consistancy and analysis robustness!
The pre/post Jul 2009 data sets seem to suggest that we are looking at two different data sets with possibly different characteristics.
Therefore, it would be useful to have a look at the time trends for this data and see if there is any more insights can be gleaned from that!
# let's get a summary view of the LoanDate values.
dfc['LoanDate'].describe()
get_ipython().ast_node_interactivity = 'all'
years = (dfc['LoanDate'].max().year - dfc['LoanDate'].min().year)
years1 = dfc['LoanDate'].min().year
years2 = dfc['LoanDate'].max().year
months1 = dfc['LoanDate'].min().month
months2 = dfc['LoanDate'].max().month
print('Start year, End year')
years1, years2
print('Start Month, End months')
months1, months2
print('no. years, no. months')
years, years * 12
#create bins equal to number of months in the range
bins = (years - 2)*12 + months1 + months2 #(total number of years - first and last year)*12months + months from 1st year, and last year.
bins
# let's look at the day distribution of loans.
# we can clearly see a gap in the middle of the data around end of 2008 and start 2009.
# bining data column is not a good idea, as it does not obey the month bounries. have to manually create x-axis monthly ticks for plotting in next plot.
plt.figure(figsize=(15,5))
# bin date column
dfc['LoanDate'].hist(bins=bins, color='deepskyblue') # bins is the number of months in the period between the minimum and maximum date years.
# I think bins did not work properly because the split is equals intervals, which does not necessarly respect months start and end.
# better to plot date x-axcess bar plot than using df.hist
# let's look at the date distribution for the loans data set.
# we can see a gap in the middle splitting the data into two periods,
# Some metrics are only present in one of the periods but not the other. Looks like some metrics are removed and some new ones
# are introduced in the second period starting Aug2008.
plt.figure(figsize=(15,5))
# get year_month values for ease of plotting timeline in ordered consistent manner:
dfc['y_m'] = dfc['LoanDate'].map(lambda x: x.strftime('%Y-%m'))
grid = sns.countplot(data=dfc.sort_values(by=['y_m']), x='y_m', color='deepskyblue')
grid.tick_params(axis='x', labelrotation=90, labelsize=10)
grid.set_title('Loan Date Distribution', fontsize=14)
grid.set_xlabel("Loan dates", fontsize=12)
grid.set_ylabel("Count of loans", fontsize=12)
plt.show();
# What is the LenderYield and EstimatedReturn histograms look like? so to understand the distribution of values.
plt.figure(figsize = [20, 15])
plt.subplots_adjust(top = 0.5, bottom=0.01, hspace=.5, wspace=0.2)
# calcualte bins
start = int((dfc_new['EstimatedReturn'].min()-0.01) * 100)/100
end = int((dfc_new['EstimatedEffectiveYield'].max()+0.01)*100)/100
bins = np.arange(start, end , 0.01, dtype=float)
ax2 = plt.subplot(2, 2, 2)
ax2 = dfc_new['EstimatedEffectiveYield'].hist(bins = bins , color='deepskyblue');
ax2.set(xlabel="EstimatedEffectiveYield", ylabel="Count", title='EstimatedEffectiveYield Historgram')
ax1 = plt.subplot(2, 2, 1, sharex=ax2, sharey=ax2)
ax=dfc_new['LenderYield'].hist(bins = bins , color='deepskyblue')
ax.set(xlabel="LenderYield", ylabel="Count", title='LenderYield Histogram')
ax3 = plt.subplot(2, 2, 3, sharex=ax2, sharey=ax2)
ax = dfc_new['EstimatedReturn'].hist(bins = bins , color='deepskyblue');
ax.set(xlabel="EstimatedReturn", ylabel="Count", title='EstimatedReturn Historgram')
ax4 = plt.subplot(2, 2, 4, sharex=ax2, sharey=ax2)
ax=dfc_new['EstimatedLoss'].hist(bins = bins , color='deepskyblue')
ax.set(xlabel="EstimatedLoss", ylabel="Count", title='EstimatedLoss Histogram')
xleft, xright = ax4.get_xlim()
yleft, yright = ax4.get_ylim()
plt.show();
# validate values
get_ipython().ast_node_interactivity = 'all'
xleft, xright
yleft, yright
# (0.014500000000000002, 0.35550000000000004)
# (0.0, 7343.7)
# If used LenderYield for generating bins, it trims EstimatedReturn x-axis for values below 0.
# These values under 0 are very few and not noticable in the histogram plot as show in the below numbers and plot
get_ipython().ast_node_interactivity = 'all'
dfc_new[dfc_new['EstimatedReturn'] < 0 ].shape
dfc_new[dfc_new['EstimatedReturn'] < 0 ]['EstimatedReturn'].min()
dfc_new[dfc_new['EstimatedReturn'] < 0 ]['EstimatedReturn'].value_counts()
plt.figure(figsize = [20, 10])
ax2 = plt.subplot(2, 2, 2)
ax2 = dfc_new['EstimatedEffectiveYield'].hist( color='deepskyblue');
ax2.set(xlabel="EstimatedEffectiveYield", ylabel="Count", title='EstimatedEffectiveYield Historgram')
# compare np bin generated array vs manually generated array of bins
np.set_printoptions(suppress=False)
bins_old = np.histogram_bin_edges(dfc_old['LenderYield'], bins='fd')
bins_old
bins_calc = np.arange(dfc_old['LenderYield'].min(), dfc_old['LenderYield'].max()+0.01 , 0.01)
bins_calc
# LenderYield is the only variable present in pre Jul 2009
# So, what is the distribution of LenderYield values like for pre 2009 and post 2009 periods?
# The distributions of pre Jul 2009 and post Jul 2009 look very similar with one noticable distinction, post Jul 2009 data has a
# large peak around LenderYield 3.1 vs pre Jul 2009, which could be due to some change in post Jul 2009 data, e.g.:
# introducing new product / loan category with fixed rate at around 3.1
# or making a change to how LenderYield is calculated.
plt.figure(figsize = [20, 15])
plt.subplots_adjust(top = 0.5, bottom=0.01, hspace=.5, wspace=0.2) # control gap between subplots
bins_new = np.histogram_bin_edges(dfc_new['LenderYield'], bins='fd') # bins=auto or bins='fd' to force the Freedman Diaconis Estimator
bins_old = np.histogram_bin_edges(dfc_old['LenderYield'], bins='fd')
ax1=plt.subplot(221)
ax1.set_xlim(xleft, xright) # set to previous plot for ease of comparison
ax1.set_ylim(yleft, yright)
ax=dfc_new['LenderYield'].hist(bins = bins_new, color='deepskyblue')
ax.set(xlabel="LenderYield", ylabel="Count", title='LenderYield Histogram - post Jul 2009')
# just a placeholder to align plots
ax2=plt.subplot(222, sharex=ax1, sharey=ax1)
ax3=plt.subplot(223, sharex=ax1, sharey=ax1)
ax=dfc_old['LenderYield'].hist(bins = bins_old, color='deepskyblue')
ax.set(xlabel="LenderYield", ylabel="Count", title='LenderYield Histogram - pre Jul 2009')
# just a placeholder to align plots
ax4=plt.subplot(224, sharex=ax1, sharey=ax1)
plt.show();
# Does BorrowerRate and BorrowerRateAPR exhibit the same observations for LenderYield when comparing pre 2009 and post 2009 periods?
# Those plots below reiterate and confirms the the previous observations in LenderYield in the above plots, where in the
# post Jul 2009 there is a large peak of BorrowerRate around (0.3-0.32) and BorrowerRateAPR (around 0.35-0.36).
# using ax2.sharey(ax1)
# using own clac of bins
plt.figure(figsize = [20, 10])
plt.subplots_adjust(top = 0.5, bottom=0.01, hspace=.5, wspace=0.2) #control gap between subplots
bins_calc = np.arange(dfc['BorrowerRate'].min(), dfc['BorrowerRate'].max()+0.01 , 0.01)
plt.subplot(221)
ax1=dfc_new['BorrowerRate'].hist(bins = bins_calc, color='deepskyblue')
ax1.set(xlabel="BorrowerRate", ylabel="Count", title='BorrowerRate Histogram - post Jul 2009')
plt.subplot(222)
ax2=dfc_new['BorrowerAPR'].hist(bins = bins_calc, color='deepskyblue')
ax2.set(xlabel="BorrowerAPR", ylabel="Count", title='New BorrowerAPR Histogram - post Jul 2009')
ax2.sharex(ax1)
ax2.sharey(ax1);
plt.subplot(223)
ax3=dfc_old['BorrowerRate'].hist(bins = bins_calc, color='deepskyblue')
ax3.set(xlabel="BorrowerRate", ylabel="Count", title='Old BorrowerRate Histogram - pre Jul 2009')
ax3.sharex(ax1)
ax3.sharey(ax1);
plt.subplot(224)
ax4=dfc_old['BorrowerAPR'].hist(bins = bins_calc, color='deepskyblue')
ax4.set(xlabel="BorrowerAPR", ylabel="Count", title='Old BorrowerAPR Histogram - pre Jul 2009')
ax4.sharex(ax1)
ax4.sharey(ax1);
The analysis above revealed that:
# Let's examine the distribution of loan Categories, where I suspect there will be important variations that influences trends in data
# Debt Consolidation is by a large margin is the most dominant loan category.
# Not Available and Other are the second most common classifications, which surprising. Also, what is the difference between them?
# Home Improvement and Business are also a bit more popular than other cateogires
grid = sns.countplot(x=df['ListingCategory'], color='deepskyblue')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Count of Loans Category')
grid.set(xlabel='Listing Category', ylabel='count');
# Lets compare pre and post Jul 2009 for loan categories
# there are more categories introduced post Jul 2009 compared to pre Jul 2009 (19 vs 6 categories) .
# Not available is the most dominant classication of loans pre Jul 2009. Does this indicate a technical fault in data export? or data capture? in Pre Jul 2009 systems
# or was it limitation when logging cateogries. Does this classification capture categories that are now better classified in post Jul 2009?
# Post Jul 2009 data set show the introduction of Debt Consolidation, which may be a totally new loan product, or was classified as Not Available in pre Jul 2009 data set
# Post Jul 2009 still contains Other category, which is also present in pre Jul 2009.
plt.figure(figsize = [20, 5])
plt.subplot(1, 2, 1)
#dfc_new['ListingCategory'].value_counts()
grid = sns.countplot(x=dfc_new['ListingCategory'], color='deepskyblue')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Count of Loans Category - post Jul 2009')
grid.set(xlabel='Listing Category', ylabel='count')
plt.subplot(1, 2, 2)
grid = sns.countplot(x=dfc_old['ListingCategory'], color='deepskyblue')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Count of Loans Category - pre Jul 2009')
grid.set(xlabel='Listing Category', ylabel='count');
# How much do people borrow?
dfc['LoanOriginalAmount'].describe()
# It seems the minimum amount of borrowing is set at 1000, and the maximum is set at 35000
# how does the histogram of Loan amounts look like?
# The histogram shows a number fo very large peaks for some loan value bins. What could be causing this?
# Use white grid plot background from seaborn
plt.figure(figsize=(15,8))
bins_x = np.arange(1000, 35000+500, 500)
#print(bins_x)
grid = sns.histplot(data=dfc, x='LoanOriginalAmount', bins=bins_x, color='deepskyblue' ) #binwidth=500
grid.set_xticks(bins_x) # <--- set the ticks first
grid.set_xticklabels(bins_x,rotation=90);
grid.set_title('Histogram Plot of LoanOriginalAmount')
grid.set(xlabel='LoanOriginalAmount Bins', ylabel='Count');
# Let's check the actual count of loan values to determine if loans are a spectrum spread of values or are fixed at specific values
# due to the wide range of loan values, let's zoom in at loans below 5000
# There are large concentrations (counts) of loans at specific values such as 2000, 2500, 3000, 3500, 4000, 4500, 5000. That are also many loans distributed between
# those peak values highlighted by the very tinny blue picks at the x-axis.
import matplotlib.ticker as ticker
sns.reset_orig()
fig = plt.figure(figsize=(20,10))
grid = sns.countplot(x=dfc_new['LoanOriginalAmount'].where(lambda x : x<=20001), color='deepskyblue')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Count of LoanOriginalAmount different values <= 5000')
grid.set(xlabel='LoanOriginalAmount', ylabel='count');
# annotate/highlight high volume counts bars
def annotate(ax,threshold):
for i,t in enumerate(ax.get_xticklabels()):
if ax.patches[i].get_height() > threshold:
## bold ticklabels
t.set_weight("bold")
## bar edges
ax.patches[i].set_edgecolor("k")
ax.patches[i].set_linewidth(1.5)
## arrow annotations
ax.annotate(t.get_text(),(i, ax.patches[i].get_height()),
xytext=(0,30), textcoords='offset points', ha="center",
arrowprops=dict(facecolor='black', shrink=0.05))
annotate(grid,300)
# change rectangle width of bars (vertical lines) in the plot to make them more pronounced
def change_width(ax, frac):
for patch in ax.patches:
current_width = patch.get_width()
new_value = current_width*frac
# we change the bar width
patch.set_width(new_value)
# we recenter the bar
diff = current_width - new_value
patch.set_x(patch.get_x() + diff * .5)
change_width(grid, 2)
# display the x-ticks only at intervals we want to highlight. otherwise x-axis becomes cluttered
def format_xlabels(grid, interval):
labelslist = grid.get_xticks().tolist()
for i,label in enumerate(grid.get_xticklabels()):
labelslist[i]=str(int(float(label.get_text())))
label.set_text(str(int(float(label.get_text()))))
if np.float64(label.get_text()) % interval == 0:
label.set_visible(True)
else:
label.set_visible(False)
grid.set_xticklabels(labelslist)
format_xlabels(grid, 500)
fig.canvas.draw()
plt.show();
# Let's zoom in at the very low values to make them visible and verify that aside from the very large peaks,
# there is a wide spread spectrum of loan values between those noticable peaks in the previous plot
import matplotlib.ticker as ticker
sns.reset_orig()
fig = plt.figure(figsize=(15,10))
grid = sns.countplot(x=dfc_new['LoanOriginalAmount'].where(lambda x : x<=5001), color='deepskyblue')
grid.set_title('Count of LoanOriginalAmount different values <= 5000')
grid.set(xlabel='LoanOriginalAmount', ylabel='count');
# annotate/highlight high volume counts
def annotate(ax,threshold):
for i,t in enumerate(ax.get_xticklabels()):
if ax.patches[i].get_height() > threshold:
## bold ticklabels
t.set_weight("bold")
## bar edges
ax.patches[i].set_edgecolor("k")
ax.patches[i].set_linewidth(1.5)
## arrow annotations
ax.annotate(t.get_text(),(i, ax.patches[i].get_height()),
xytext=(0,30), textcoords='offset points', ha="center",
arrowprops=dict(facecolor='black', shrink=0.05))
#annotate(grid,300)
# change rectangle width of bars (vertical lines) in the plot to make them more pronounced
def change_width(ax, frac):
for patch in ax.patches:
current_width = patch.get_width()
new_value = current_width*frac
# we change the bar width
patch.set_width(new_value)
# we recenter the bar
diff = current_width - new_value
patch.set_x(patch.get_x() + diff * .5)
change_width(grid, 2)
# display the x-ticks only at intervals we want to highlight. otherwise x-axis becomes cluttered
def format_xlabels(grid, interval):
labelslist = grid.get_xticks().tolist()
for i,label in enumerate(grid.get_xticklabels()):
labelslist[i]=str(int(float(label.get_text())))
label.set_text(str(int(float(label.get_text()))))
if np.float64(label.get_text()) % interval == 0:
label.set_visible(True)
else:
label.set_visible(False)
grid.set_xticklabels(labelslist)
format_xlabels(grid, 500)
# we can remove certain columns from the plot if deemed necessary.
# def remove_elm (grid, threshold):
# for patch in grid.patches:
# if patch.get_height() > 300:
# patch.remove()
# fig.canvas.draw()
# remove_elm(grid,300)
# zoom in on the y-axis to see the inbetween loans distribution
grid.set_ylim([0,100])
fig.canvas.draw();
plt.show();
# check top Loan Values
dfc_new.groupby('LoanOriginalAmount')['LoanOriginalAmount'].count().sort_values(ascending=False)[0:20]
# check bottom Loan Values
dfc_new.groupby('LoanOriginalAmount')['LoanOriginalAmount'].count().sort_values(ascending=True)[0:20]
There are no loan amounts less than 1K.
Loan amounts 1K-5K dollars are the most popular loan amounts, followed by 5K-10K loan amounts, then followed by 10K-15K
There are global 'very big' spikes for loan amounts at exact whole numbers e.g.: (4,000), (10,000), (15,000) ..etc.
There are local 'big' spikes for loan amounts at exact whole numbers e.g. (1,000),(2,000), (3,000), (5,000), (6,000) ..etc.
Then there are local medium spikes for loans at one half of numbers e.g. (1,500), (2,500), (3,500) ..etc.
The histogram plot is skewed to the right. Therefore, given it is monetary values, a log transformation on the x axis might be suitable for explaining the loan amount trends.
# lets check what a log of LoanOriginalAmount might look like:
np.log10(dfc['LoanOriginalAmount'].describe())
# let's create log bins for histogram
bins_xlog = 10 ** np.arange(3, 4.55 + 0.1, 0.1)
bins_xlog
# let's plot the histogram of log transformation on LoanOriginalAmount:
# The histogram now exhibits a shape similar to normal distribution bell. Which indicates the LoanOriginalAmount trends to change by multiplicative factors.
plt.figure(figsize=(10,5))
tick_locs = [1000, 2000, 5000, 10000, 20000, 50000]
grid = sns.histplot(data=dfc, x='LoanOriginalAmount', bins=bins_xlog) #binwidth=500
grid.set(xscale="log") #, yscale="log"
grid.set_xticks(tick_locs) # <--- set the ticks first
grid.set_xticklabels(tick_locs, rotation=90);
grid.set_title('Histogram Plot of Log LoanOriginalAmount')
grid.set(xlabel='Log LoanOriginalAmount Bins', ylabel='Count');
# So, how does the LoanOriginalAmount differ between data for post Jul 2009 and data for pre Jul 2009?
# Pre Jul 2009 data shows loan amounts in the smaller values to be the most dominant, however
# Post Jul 2009 data shows a reduction of very small amounts less than 2000, and overall increase in the other loan amount values, but more noticably
# a very large increase in (4000-5000, 10000-11000, 15000-16000, 20000-21000, 25000-26000)!
plt.figure(figsize = [20, 5])
plt.subplot(1, 2, 1)
bins_x = np.arange(1000, 35000+1000, 1000)
grid1 = sns.histplot(data=dfc_new, x='LoanOriginalAmount', bins=bins_x) #binwidth=500
grid1.set_xticks(bins_x) # <--- set the ticks first
grid1.set_xticklabels(bins_x, rotation=90, fontsize=12);
grid1.set_title('Histogram Plot of LoanOriginalAmount - Post Jul 2009')
grid1.set(xlabel='LoanOriginalAmount Bins', ylabel='Count');
plt.subplot(1, 2, 2)
bins_x = np.arange(1000, 35000+1000, 1000)
grid2 = sns.histplot(data=dfc_old, x='LoanOriginalAmount', bins=bins_x) #binwidth=500
grid2.sharey(grid1) # adjust y scales for ease of comparison
grid2.set_xticks(bins_x) # <--- set the ticks first
grid2.set_xticklabels(bins_x, rotation=90, fontsize=12);
grid2.set_title('Histogram Plot of LoanOriginalAmount - Pre Jul 2009')
grid2.set(xlabel='LoanOriginalAmount Bins', ylabel='Count');
# product the bins for the log transformation
bins_xlog = 10 ** np.arange(3, 4.55 + 0.1, 0.1)
bins_xlog
# Ok, so let's apply log transformation to the LoanOriginalAmount for post Jul 2009 and data for pre Jul 2009 datasets?
# The plots have resplance of a normal distribution bill curve. However the very large peaks are distorting the bill shape.
# excluding the very lage peaks for very specific loan amounts, once can potentially observe a multplicative increase of value in the
# post Jul 2009 dataset. While the pre Jul 2009 data set could be described having bimodal shape.
# Post Jul 2009 data shows the shift towards higher loan amounts vs the pre Jul 2009
plt.figure(figsize=(15,5))
plt.subplot(1, 2, 1)
tick_locs = [1000, 2000, 5000, 10000, 20000, 50000] #cycles of 1-3-10 or 1-2-5-10 are very useful for base-10
grid = sns.histplot(data=dfc_new, x='LoanOriginalAmount', bins=bins_xlog) #binwidth=500
grid.set(xscale="log") #, yscale="log"
grid.set_xticks(tick_locs) # <--- set the ticks first
grid.set_xticklabels(tick_locs,rotation=90);
grid.set_title('Histogram Plot of Log LoanOriginalAmount - post Jul 2009')
grid.set(xlabel='Log LoanOriginalAmount Bins', ylabel='Count');
plt.subplot(1, 2, 2)
tick_locs = [1000, 2000, 5000, 10000, 20000, 50000] #cycles of 1-3-10 or 1-2-5-10 are very useful for base-10
grid = sns.histplot(data=dfc_old, x='LoanOriginalAmount', bins=bins_xlog) #binwidth=500
grid.set(xscale="log") #, yscale="log"
grid.set_xticks(tick_locs) # <--- set the ticks first
grid.set_xticklabels(tick_locs,rotation=90);
grid.set_title('Histogram Plot of Log LoanOriginalAmount - pre Jul 2009')
grid.set(xlabel='Log LoanOriginalAmount Bins', ylabel='Count');
The previous analysis showed a shift towards higher loan amounts post Jul 2009 vs pre Jul 2009.
pre Jul 2009 data shows a reduction in the very small loan amounts (less than 2000)
Post Jul 2009 data shows a very large increase of (4000, 5000, 10000, 15000, 20000, and 25000) loan amounts
The LoanOriginalAmount histogra distribution shows right skewed histogram. Ignoring the very large peaks identifed in the prievous point, the histograms take a bimodal shape specially between these values (1K-5K, 5K-10K, 10K-15K, 15K-20K, and 20K-25K)
LoanOriginalAmoutn seems to increase in multiplicative fashion.
# how does the CreditScore look like (only available for post Jul 2009) and what is the relationship between the Lower and Upper credit scores?
# we see that the lower and upper credit scores represent the coordinates of 15 points (credit scores) (dots on the scale),
# and they are not a scatter of values as originally thought.
grid = sns.scatterplot(data=dfc_new[dfc_new['CreditScoreRangeLower'].notna()], x='CreditScoreRangeLower', y='CreditScoreRangeUpper', color='deepskyblue')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Scatterplot of CreditScore Lower vs Upper ranges - Post Jul 2009')
grid.set(xlabel='Lower Credit Range', ylabel='Upper Credit Range');
# let's plot the CreditScore, CreditGrade, ProsperScore, and ProsperRating variables to understand their distribution
# pre Jul 2009 data shows that CreditGrade was used solely to determine credit worthness of borrowers. However post Jul 2009 data shows
# that other credit rating metrics were used to determine credit worthness of borrowers and and risk associated with Loans namely;
# Prosper Rating, Prosper Score, and CreditScore.
plt.figure(figsize=(20,15))
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.2, wspace=0.2) #control gap between subplots
ax1=plt.subplot(2,3,1)
grid = sns.countplot(data=dfc_old[dfc_old['CreditGrade'].notna()], x='CreditGrade', color='deepskyblue')
grid.tick_params(axis='x') #, labelrotation=90
grid.set_title('Countplot of CreditGrade categories - Pre Jul 2009')
grid.set(xlabel='CreditGrade Categories', ylabel='Count');
ax2=plt.subplot(2,3,2)
grid = sns.countplot(data=dfc_new[dfc_new['ProsperRating (Alpha)'].notna()], x='ProsperRating (Alpha)', color='deepskyblue')
grid.tick_params(axis='x') #, labelrotation=90
grid.set_title('Countplot of ProsperRating categories - Post Jul 2009')
grid.set(xlabel='ProsperRating', ylabel='Count');
ax4=plt.subplot(2,3,3)
grid = sns.countplot(data=dfc_new[dfc_new['ProsperScore'].notna()], x='ProsperScore', color='deepskyblue')
grid.tick_params(axis='x') #, labelrotation=90
grid.set_title('Countplot of ProsperScore - Post Jul 2009')
grid.set(xlabel='Prosper Score', ylabel='Count');
ax3=plt.subplot(2,3,4)
grid = sns.countplot(data=dfc_new[dfc_new['CreditScoreRangeLower'].notna()], x='CreditScoreRangeLower', color='deepskyblue')
grid.tick_params(axis='x') #, labelrotation=90
grid.set_title('Countplot of CreditScore Lower range - Post Jul 2009')
grid.set(xlabel='Lower Credit Range', ylabel='Count');
grid.tick_params(axis='x', labelrotation=90)
ax3=plt.subplot(2,3,5)
grid = sns.countplot(data=dfc_new[dfc_new['CreditScoreRangeUpper'].notna()], x='CreditScoreRangeUpper', color='deepskyblue')
grid.tick_params(axis='x') #, labelrotation=90
grid.set_title('Countplot of CreditScore Upper range - Post Jul 2009')
grid.set(xlabel='Upper Credit Range', ylabel='Count');
grid.tick_params(axis='x', labelrotation=90);
pre Jul 2009 data shows that CreditGrade was used solely to determine credit worthness of borrowers.
post Jul 2009 data shows that other credit rating metrics were used to determine credit worthness of borrowers and and risk associated with Loans namely; Prosper Rating, Prosper Score, and CreditScore.
The Prosper Rating, Prosper Score, and CreditScore have a bell shaped normal distribution with slight skewness to the right.
Most loand are given to borrowers that have middle score / rating. Fewer loans are given to borrowers sitting on opposite sides of the scale. This is probably just the normal distribution of borrowers anyway.
There are loans given to borrowers marked as High Risk (HR)!
# Lest look into IncomeRange and see which income range gets most loans?
# The income range from 25K-75K gets the highest number of loans. The range from 75K to 100K+ gets the second highest loan numbers.
# Very few people with 0 income or no employment get a loan.
plt.figure(figsize=(10,5))
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.2, wspace=0.2) #control gap between subplots
grid = sns.countplot(data=dfc[dfc['IncomeRange'].notna()], x='IncomeRange', color='deepskyblue')
grid.tick_params(axis='x') #, labelrotation=90
grid.set_title('Countplot of IncomeRange categories')
grid.set(xlabel='IncomeRange Categories', ylabel='Count');
grid.tick_params(axis='x', labelrotation=90);
# What is the distribution of DebtToIncomeRatio?
# 75% of loans given have an DebtToIncome Ratio less than 0.33. With a mean value of about 0.23.
# Fewer people are have a debt to income ratio higher than 1/3!
plt.figure(figsize=(10,10))
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.7, wspace=0.5)
ax1=plt.subplot(2,2,1)
grid = sns.boxplot(data=dfc, y='DebtToIncomeRatio')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Debt to Income Ratio')
grid.set(xlabel='', ylabel='DebtToIncomeRatio');
# zoomed in plot
ax2=plt.subplot(2,2,2)
grid = sns.boxplot(data=dfc, y='DebtToIncomeRatio')
grid.set_ylim([0,1]) #zoom at the y-axis
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Debt to Income Ratio - y-axis zoom')
grid.set(xlabel='', ylabel='DebtToIncomeRatio');
# How does home ownership distribution looks like in the dataset?
# We have about a 50/50 split in home ownership.
# This variable does not seem to have a strong influence on who gets a loan!
plt.figure(figsize=(8,3))
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.2, wspace=0.2) #control gap between subplots
grid = sns.countplot(data=dfc, x='IsBorrowerHomeowner', color='deepskyblue')
grid.set_title('Countplot of IsBorrowerHomeOwner categories')
grid.set(xlabel='IsBorrowerHomeOwner Categories', ylabel='Count');
grid.tick_params(axis='x', labelrotation=90);
# Lest look into Term and see which Terms are most popular?
# The 36 months Term is the mos popular type of Loan, followed by the 60 months Term Loan, then a finally a small presence of 12 months Term Loan.
plt.figure(figsize=(8,3))
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.2, wspace=0.2) #control gap between subplots
grid = sns.countplot(data=dfc[dfc['Term'].notna()], x='Term', color='deepskyblue')
grid.set_title('Countplot of Term categories')
grid.set(xlabel='Term', ylabel='Count');
grid.tick_params(axis='x', labelrotation=90);
Timeline Variations
Loan Categories
Lender Yield
Estimated Return
Loan Amount
Prosper Rating and Credit Score
Income Range and Debt to Income Ratio
Home Ownership
Term
The following were unusual distribtions:
1- The data is split into two different sets pre Jul 2009 and post Jul 2009. The two data sets have key differences, for example post Jul 2009 introduces new variables and drop others when compared to pre Jul 2009. Therefore, for many exploratory analysis and when applying statistical measures, it was necessarly to treat those two sets differently using a split dataframe dfc_new and dfc_old.
2- Loan amounts have very large peaks in between overshadowing the rest of the loan amounts distributions. Bining the data in histograms helped smoothing the data a bit. Also applying log transformation to loan amounts helped in detecting the multiplicative nature of loan amounts spread.
3- LenderYield Histogram plot exhibited a bimodal shape, however 3.0-3.2 showed the presence of an abnormal very large peak of loans at this value. This needs further investigation to understand what may be causing it!
4- ListingCategory was converted from unmerical to category names, to display meaningful values on the x-axis.
5- LoanStatus, CreditGrade, ProsperRating (Alpha), IncomeRange, and year were converted to ordered categorical variables so that plots show ordered categories for comparison purposes.
6- Converted LoanOriginalDate to a date type column, and also extracted year-month and year columns to use for plotting monthly and yearly variations. Year will also be used for segmenting data in upcoming analysis.
# Let's see how does LenderYield for pre Jul 2009 data vary according to Listing Category?
# Most distributions seems centered around the 0.15, and there is an unusual large spike around 0.34 for all Listing Categories.
# So other than volume difference, Listing Category in pre Jul 2009 data does not seem to impact LenderYield or cause changes
plt.figure(figsize = [20, 10])
sns.set(font_scale=1)
grid = sns.FacetGrid(data=dfc_old, col='ListingCategory', col_wrap=4, margin_titles=True, height=3, aspect= 1.33) # control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.histplot, 'LenderYield', binwidth=0.01); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.9)
grid.fig.suptitle('Histogram Distribution of LenderYield by Listingcategories - pre Jul 2009!', fontsize=14);
# How does LenderYield for post Jul 2009 vary according to Listing Category?
# Business and Home Improvement the distribution ditributions are flatter, and a bit a random shape in post Jul 2009 data when compared to pre Jul 2009
# data, where the data takes a bell shape.
# Also, the post Jul 2009 data shows the oustanding large peak at 0.3, whilest pre Jul 2009 data appears at about 0.34!
# still listing category does not seem to change LenderYield in any predicatable or noticable trend.
plt.figure(figsize = [20, 10])
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='ListingCategory', col_wrap=4, margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.histplot, 'LenderYield', binwidth=0.01); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.95)
grid.fig.suptitle('Histogram Distribution of LenderYield by Listingcategories - post Jul 2009!', fontsize=14);
# let's look at the mean LenderYield distribution for all categories:
# dfc_new.groupby('ListingCategory')['LenderYield'].mean().sort_values() #.index
# Debt Consolidationhas being the most popular loan type, seems to have the third lowest LenderYields in average.
plt.figure(figsize=(10,5))
grid = sns.barplot(data=dfc_new, x='ListingCategory', y='LenderYield', estimator=np.mean, order=dfc_new.groupby('ListingCategory')['LenderYield'].mean().sort_values().index, color='deepskyblue')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Mean LenderYield by ListingCategory - post Jul 2009')
grid.set(xlabel='ListingCategory', ylabel='LenderYield');
# lets check the distribution of of LinderYield for loans ListingCategories
# Distributions seem similar, with the median value it is centered around pushing up or down the scale. Some distributions are a bit narrower or wider than others.
# Ignoring "Other", Student User, Green Loans, and Business seem to have some of the widest spread of values.
plt.figure(figsize=(10,10))
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.7, wspace=0.2) #control gap between subplots
sns.set_theme() #deafult theme
ax1=plt.subplot(2,1,1)
grid = sns.boxplot(data=dfc_new, y='LenderYield', x='ListingCategory')
#grid.set_ylim([0,2])
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('LinderYield by ListingCategory Boxplot - post Jul 2009')
grid.set(xlabel='ListingCategory', ylabel='LinderYield');
ax1=plt.subplot(2,1,2)
grid = sns.boxplot(data=dfc_old, y='LenderYield', x='ListingCategory')
#grid.set_ylim([0,2])
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('LinderYield by ListingCategory Boxplot - pre Jul 2009')
grid.set(xlabel='ListingCategory', ylabel='LinderYield');
# grid = sns.stripplot(data=dfc, y='DebtToIncomeRatio', x='IncomeRange')
# How does EstimatedReturn for post Jul 2009 data vary according to Listing Category?
# this information is not available for pre Jul 2009
# We can see for all categories the EstimatedReturn is nearly centered around the 0.1 and is spread between approx. 0.5-1.5 and is less than the LenderYield
# above. The spread of values is much narrower and is less than 0.2 despite LenderYield has spread between approx. 0.5 to 3.5.
# Therefore for all listing categories the return on loans does not seem to be affected by categories, and won't increase in the same magnitude as lenderYield.
plt.figure(figsize = [20, 10])
sns.set(font_scale=1)
grid = sns.FacetGrid(data=dfc_new, col='ListingCategory', col_wrap=4, margin_titles=True, height=3, aspect= 1.33) # control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.histplot, 'EstimatedReturn', binwidth=0.01); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.95)
grid.fig.suptitle('Histogram Distribution of EstimatedReturn by ListingCategories - post Jul 2009!', fontsize=14);
# let's look at the mean LenderYield distribution for all categories:
# dfc_new.groupby('ListingCategory')['LenderYield'].mean().sort_values() #.index
# Debt Consolidationhas being the most popular loan type, seems to have the third lowest LenderYields in average.
# Error bars are likely to be influnced by the size of data availble in each Listing category
plt.figure(figsize=(10,5))
sns.set_theme() #go back to default
grid = sns.barplot(data=dfc_new, x='ListingCategory', y='EstimatedReturn', estimator=np.mean, order=dfc_new.groupby('ListingCategory')['EstimatedReturn'].mean().sort_values().index, color='deepskyblue')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Mean EstimatedReturn per ListingCategory - post Jul 2009')
grid.set(xlabel='ListingCategory', ylabel='EstimatedReturn')
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.7, wspace=0.5);
# Let's insepct BorrowerRate vs LenderYield (pre and post Jul 2009) and BorrowerRate vs Estimated Return (only available in post Jul 2009)
# so we can try to understand how borrower rate translates to LenderYield then to EstimatedReturn to Lenders.
# The plot shows that there is a near perfect direct linear relationship between BorrowerRate and LenderYield post Jul 2009, which makes sense.
# For pre Jul 2009 data, it is a bit less perfect, showing some level of closely lined up bandings,
# why does LenderYiled in pre Jul 2009 have multiple layers, does it include calcualtions of other factors such as cost?
plt.figure(figsize = [20, 5])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.7, wspace=0.2) #control gap between subplots
plt.subplot(1, 2, 1)
grid = sns.scatterplot(data=dfc_old[dfc_old['BorrowerRate'].notna()], x='BorrowerRate', y='LenderYield', color='deepskyblue')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Scatter plot of BorrowerRate vs LenderYield - Pre Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
plt.subplot(1, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='LenderYield', color='deepskyblue')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Scatter plot of BorrowerRate vs LenderYield - Post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
# Let's insepct BorrowerRate vs LenderYield, EstimatedLoss, EstimatedEffectiveYield, and EstimatedReturn which are only available in post Jul 2009.
# so we can try to understand how BorrowerRate translates to LenderYield and then to EstimatedReturn.
# The plot shows that there is a near perfect direct linear relationship between BorrowerRate and LenderYield post Jul 2009, which makes sense.
# The BorrowerRate vs EstimatedLoss, EstimatedEffectiveYield, and EstimatedReturn show the presence of two types of swarm of data:
# one swarm of data is split into clearly layered bands that seem to correlated to a fixed EstimatedLoss and don't change with BorrowerRate!
# A second swarm of data exhibits a linear relationship between BorrowerRate and EstimatedLoss.
# Those swarms/bands of data are indicative of other variables/factors impacting the duo relationship. To be investigated further in the Multi variate Explorations.
# One key observation to take away is that the EstimatedLoss plot shows that as EstimatedLoss increases, BorrowerRate increases sharply in direct positive manner!
plt.figure(figsize = [20, 10])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.7, wspace=0.2) #control gap between subplots
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='LenderYield', color='deepskyblue')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs LenderYield - Post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
plt.subplot(2, 2, 2)
grid = sns.regplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedLoss',
line_kws={'color': 'g'}, scatter_kws={'color': 'deepskyblue','edgecolor':'white'}, fit_reg=True, ci=68, logistic=True);
grid.tick_params(axis='x', labelrotation=90)
grid.set_title(' BorrowerRate vs EstimatedLoss - Post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
plt.subplot(2, 2, 3)
grid = sns.regplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedEffectiveYield',
line_kws={'color': 'g'}, scatter_kws={'color': 'deepskyblue','edgecolor':'white'}, fit_reg=True, ci=68, logistic=True);
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedEffectiveYield - Post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
plt.subplot(2, 2, 4)
grid = sns.regplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedReturn',
line_kws={'color': 'g'}, scatter_kws={'color': 'deepskyblue','edgecolor':'white'}, fit_reg=True, ci=68, logistic=True);
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedReturn - Post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
# let's see if splitting LoanOriginalAmount by LoanStatus show any sub relationsships:
# The graph seems to suggest that the there are no Past Due loans. But this may be due to scaling effect of very low volumes. needs another count test in the next cell.
# The most current loan amounts are 15K, 4K, and 10K.
#sns.displot(dfc, x='LoanOriginalAmount', row='LoanStatus')
grid = sns.FacetGrid(data = dfc, col = 'LoanStatus', col_wrap=4, margin_titles=True, height=3, aspect= 1.33)
grid.map(sns.histplot, 'LoanOriginalAmount');
grid.fig.subplots_adjust(top=0.9)
grid.fig.suptitle('Histogram Distribution of LoanOriginalAmount by LoanStatus!', fontsize=14);
# let's check a table sum of the LoanStatuses.
# Past Due columns have loans present, but due the volumes being quite low, graph above did not show them clearly.
dfc.groupby(['LoanStatus'])['ListingKey'].agg([('count',lambda x: np.count_nonzero(x)),('percent',lambda x: 100*np.count_nonzero(x)/len(dfc))]).reset_index()
#dfc.groupby(['LoanStatus'])['ListingKey'].count().apply(lambda x: x/len(dfc))
#np.count_nonzero([np.nan,1,2]) # => 3
#np.count_nonzero([0,1,2]) #=> 2
# how does IncomeRange vs DebtToIncomeRatio look like?
# borrowers in the 1-24,999 earning bracket are more common to take up loans with a higher fraction of their income with many having ratios higher than their income.
# The No Employment group although the smallest group, shows very high DebtToIncomeRatio compared to other groups.
# As the income bracket inceases, the DebtToIncomeRatio decreases. Is this due to borrowers of higher incoming taking smaller loans?
# or is this observation is due to merely the math (dividing by larger value) related to the increase of their income?
# What is the average loan amount per income group?
# These feature won't be investigated further!
plt.figure(figsize=(20,10))
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.7, wspace=0.2) #control gap between subplots
sns.set_style("ticks")
ax1=plt.subplot(2,2,1)
grid = sns.boxplot(data=dfc, y='DebtToIncomeRatio', x='IncomeRange')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Debt to Income Ratio by Income Range')
grid.set(xlabel='IncomeRange', ylabel='DebtToIncomeRatio');
ax1=plt.subplot(2,2,2)
grid = sns.stripplot(data=dfc, y='DebtToIncomeRatio', x='IncomeRange')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Debt to Income Ratio by Income Range')
grid.set(xlabel='IncomeRange', ylabel='DebtToIncomeRatio');
ax1=plt.subplot(2,2,3)
grid = sns.boxplot(data=dfc, y='DebtToIncomeRatio', x='IncomeRange')
grid.set_ylim([0,2])
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Debt to Income Ratio by Income Range - Zoom DebtToIncomeRatio < 2')
grid.set(xlabel='IncomeRange', ylabel='DebtToIncomeRatio');
# What is the relationship between Terms of contract (Contract Length) and ProsperRating (Alpha)?
# We can see borrowers flaged with high risk (HR) are only awarded 36 months Term loans
# Also highest number of loans awarded are to borrowers with ProsperRating (Alpha) between A-E for 36 months Term.
# And the second highest number of loans awarded are to borrowers with ProsperRating (Alpha) between B-C for 60 months Term.
dfc_new = dfc_new.loc[:,~dfc_new.columns.duplicated()].copy()
tabular = dfc_new.pivot_table(values='MemberKey', index=['ProsperRating (Alpha)'], columns=['Term'], aggfunc='count', fill_value=0)
#tabular
grid = sns.heatmap(tabular, annot=True, cmap='Blues', fmt=',d')
grid.set_title('Count of Intersects between Term loans and ProsperRating (Alpha) levels');
# What is the relationship between Term and ProsperScore?
# Contrary to ProspectRating, we can see that ProspectScore levels having loans with all the three Terms (12, 36, 60).
# But similar to ProspectRating, ProspectScore have most loans awarded for middle tier levels.
tabular = dfc_new.pivot_table(values='MemberKey', index=['ProsperScore'], columns=['Term'], aggfunc='count', fill_value=0)
tabular
grid = sns.heatmap(tabular, annot=True, cmap='Blues', fmt=',d', linewidth=.5)
grid.set_title('Count of Intersects between Term loans and ProsperScore levels')
grid.set(xlabel="Term", ylabel="ProsperScore")
grid.xaxis.tick_bottom();
grid.invert_yaxis();
# Instead of checking pairs of data hoping to stumble upon meaningful relationships
# let's plot multiple pairwise bivariate distributions in the dataset to quickly glean hidden relationships between those pairs of variables.
# Relationships can be spotted between BorrowerRate vs LenderYield, BorrowerRate vs EstimatedReturn, BorrowerRate vs BorrowerAPR,
# Estimated Return vs LenderYield, BorrowerAPR vs vs LenderYield, CreditScoreRatingUpper vs CreditScoreRatingLower,
# borrowerRate vs EstimatedLoss, LenderYield vs Estimated Loss, and Estimated Return vs EstimatedEffectiveYield.
# a number of plots show the presence of bands of data points, indicating third variables affecting the duo relationship.
sns.pairplot(dfc, diag_kind="hist", corner=True);
A number of bivariate investigations were carried out on the dataset:
# let's check the average LenderYield for each category split by period pre/post Jul 2009
# post Jul 2009 data shows higher LenderYiled for the categories where there are pre and post Jul 2009 data except Debt Consolidation, where it was slightly less.
plt.figure(figsize=(10,5))
sns.set(font_scale=1.5, style="whitegrid")
grid = sns.barplot(data=dfc, x='ListingCategory', y='LenderYield', hue='period', estimator=np.mean) #np.count_nonzero # np.sum
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Mean LenderYield by ListingCategory - pre Jul 2009 vs post Jul 2009')
grid.set(xlabel='', ylabel='LenderYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
# lets see how LinderYield varies per category for pre and post Jul 2009 data sets in one plot
# The plot shows that that post-Jul 2009 loans have higher LinderYields compared to pre-Jul 2009 loans
plt.figure(figsize=(15,10))
sns.set_theme()
grid = sns.boxplot(data=dfc, y='LenderYield', x='ListingCategory', hue='period')
#grid.set_ylim([0,2])
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('LinderYield by ListingCategory Boxplot')
grid.set(xlabel='ListingCategory', ylabel='LinderYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
# lets see how EstimatedReturn varies per category for pre and post Jul 2009 data sets in one plot
# This comparison was not useful because EstimatedReturn is not present for pre-Jul2009 dataset except for odd handful outlier datapoints!
plt.figure(figsize=(15,10))
sns.set_theme()
grid = sns.boxplot(data=dfc, y='EstimatedReturn', x='ListingCategory', hue='period')
#grid.set_ylim([0,2])
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('EstimatedReturn by ListingCategory Boxplot')
grid.set(xlabel='ListingCategory', ylabel='EstimatedReturn');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
# proves above conclusion about pre-Jul2009 not having EstimatedReturn! and so above comparison plot did not prove to be that useful
dfc.groupby(['ListingCategory', 'period'])['LoanKey'].count()#.reset_index(name='count') #dfc['EstimatedReturn'].notnull()
# lets see how BorrowerRate varies per category for pre and post Jul 2009 data sets in one plot
# The plot shows that that post-Jul 2009 loans have higher BorrowerRate compared to pre-Jul 2009 loans. Which marries up with LenderYield trends plotted above.
plt.figure(figsize=(15,10))
grid = sns.boxplot(data=dfc, y='BorrowerRate', x='ListingCategory', hue='period')
#grid.set_ylim([0,2])
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate by ListingCategory Boxplot')
grid.set(xlabel='ListingCategory', ylabel='BorrowerRate');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
# How does EstimatedReturn and LenderYield distribution relate to each other for post Jul 2009 data vary according to Listing Category?
# this information is not available for pre Jul 2009
# We can see for all categories the EstimatedReturn is nearly centered around the 0.1 except for Debt Consolidation which is centered around 0.08.
# The plot shows that overall EstimatedReturn values are less than LenderYield values.
# Also, the spread of EstimatedReturn (0.02-1.8) is much narrower than LenderYield (0.05-0.36)
plt.figure(figsize = [20, 10])
palette = sns.color_palette("deep", 3)
sns.set(font_scale=1)
# plt.subplots_adjust(top = 0.5, bottom=0.01, hspace=.5, wspace=0.2) #control gap between subplots
grid = sns.FacetGrid(data=dfc_new, col='ListingCategory', col_wrap=4, margin_titles=True, height=3, aspect= 1.33)
line1 = grid.map(sns.histplot, 'EstimatedReturn', binwidth=0.01, color=palette[1], alpha = 0.7, label='EstimatedReturn'); # remove binwidth if you want auto generated different bin widths for each plot.
line2 = grid.map(sns.histplot, 'LenderYield', binwidth=0.01, color=palette[2], alpha=0.7, label='LenderYield');
grid.fig.subplots_adjust(top=0.95)
grid.fig.suptitle('Histogram Distribution of EstimatedReturn and LenderYield by Listingcategories - post Jul 2009!', fontsize=14);
handles, labels = grid.fig.gca().get_legend_handles_labels()
grid.fig.legend(handles=handles,labels=labels, title='Legend', loc='center right', bbox_to_anchor=(1.15, 0.5)); # add a facetgrid legend for the different grid.map plots.
# How does Lender Yield for post Jul 2009 vary according to Listing Category and Term?
# This plot confirm previous univariate analysis showing 36 months the domenant Term in the data. Also, it confirm that this is trend is present in all categories.
# Also, This plot shows that LenderYield is different for different Terms.
# Looking at the Debt Consolidation Listing Category, the 60 months Term Loans seem more normally distributed with a bell peak occuring around approx 0.16,
# Whereas the 36 months Loans have a stronger skewness to the right with the bell peak occuring around 0.13 and an unusual large peak of loans at around 0.3
# Also the 36 months Term Loans are more popular than the 12 months or the 60 months Term. the 12 months Term are the least popular and are hardly noticed in the plot.
# From this plot we can tell the large peak is associated with the 36 months Terms contract. but what for?
plt.figure(figsize = [20, 10])
# plt.subplots_adjust(top = 0.5, bottom=0.01, hspace=.5, wspace=0.2) #control gap between subplots
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='ListingCategory', hue='Term', col_wrap=4, margin_titles=True, height=3, aspect= 1.33) # control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.histplot, 'LenderYield', binwidth=0.01, alpha=0.5); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.95)
grid.fig.suptitle('Histogram Distribution of LenderYield by ListingCategories - post Jul 2009!', fontsize=14);
grid.add_legend();
grouped = dfc.groupby(['period','ListingCategory'])['ListingKey'].count().reset_index().rename(columns={'ListingKey':'count'})
grouped['subTotal']=grouped.groupby(['period'])['count'].transform('sum')
grouped['percent'] = 100 * grouped['count']/grouped['subTotal']
pd.options.display.float_format = '{:.2f}'.format # does not affect number precision, but display only.
pd.reset_option('display.max_rows')
grouped#.sort_values(by='ListingCategory')#.head(30)
# How does LenderYield for post Jul 2009 vary according to LoanStatus and Term?
# We can see that current loans have the highest loans in the dataset, followed by completed status.
# Chargedoff have the 3rd highest presence in the dataset, whilest the Past Due Date have much lower number of loans.
# We still can see the high peak at around 0.3 present across all LoanStatus categories. So, it is not influenced by loanStatus. what is it related to, then?
bins = np.arange(0,0.4,0.01)
plt.figure(figsize = [20, 10])
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='LoanStatus', hue='Term', col_wrap=4, margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.histplot, 'LenderYield', bins=bins, alpha=0.5); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.9)
grid.fig.suptitle('Histogram Distribution of LenderYield by LoanStatus - post Jul 2009!', fontsize=14);
grid.add_legend();
# How does LenderYield for post Jul 2009 vary according to ProsperScore and Term?
# We can see that the peak at 0.3 is present mainly in ProsperScore 1, 2, 3, 4, 5.
# This suggests that it is associated with lower score borrowers.
# How is this reflected in other borrower rating/scoring variables.
bins = np.arange(0,0.4,0.01)
plt.figure(figsize = [20, 10])
# plt.subplots_adjust(top = 0.5, bottom=0.01, hspace=.5, wspace=0.2) #control gap between subplots
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='ProsperScore', hue='Term', col_wrap=4, margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.histplot, 'LenderYield', bins=bins, alpha=0.5); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.9)
grid.fig.suptitle('Histogram Distribution of LenderYield by ProsperScore - post Jul 2009!', fontsize=14);
grid.add_legend();
# How does LenderYield for post Jul 2009 vary according to ProsperRating (Alpha) and Term?
# The plots here revealed that the 0.3 high peak is very strongly associated with loans given to individuals who are identified High Risk (HR) and to a smaller degree to level E.
# Also, this plots shows that ProsperRating (Alpha) have a very strong influence on LenderYiled levels, where LenderYield increases as the ProsperRating (Alpha) decreases
bins = np.arange(0,0.4,0.01)
plt.figure(figsize = [20, 10])
# plt.subplots_adjust(top = 0.5, bottom=0.01, hspace=.5, wspace=0.2) #control gap between subplots
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='ProsperRating (Alpha)', hue='Term', col_wrap=4, margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.histplot, 'LenderYield', bins=bins, alpha=0.5); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.9)
grid.fig.suptitle('Histogram Distribution of LenderYield by ProsperRating (Alpha) - post Jul 2009!', fontsize=14);
grid.add_legend();
# Given our finding above, How does LenderYield for pre Jul 2009 vary according to CreditGrade and Term?
# CreditGrade is for pre Jul 2009 which is similar to ProsperRating (Alpha)for post Jul 2009
# There seems to be only one Term (36 months) present in the pre Jul 2009 data.
# The plot shows that LenderYield increased on the x-axis (Histogram shift from left skewed to right skewed) as CreditGrade decreased going from AA to HR.
# The plots show that for pre Jul 2009 loans, there are two peaks; one at around 0.34-0.35 and another at around 0.28 which are present in the lower CreditGrades and HR individuals loans
# What could be causing those two peaks in the pre Jul 2009?
bins = np.arange(0,0.4,0.01)
plt.figure(figsize = [20, 10])
# plt.subplots_adjust(top = 0.5, bottom=0.01, hspace=.5, wspace=0.2) #control gap between subplots
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_old, col='CreditGrade', hue='Term', col_wrap=4, margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.histplot, 'LenderYield', bins=bins, alpha=0.5); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.9)
grid.fig.suptitle('Histogram Distribution of LenderYield by CreditGrade - pre Jul 2009!', fontsize=14);
grid.add_legend();
# lets see how LinderYield varies per category for pre and post Jul 2009 data sets in one plot
# The plot shows that that post-Jul 2009 loans have higher LinderYields compared to pre-Jul 2009 loans
plt.figure(figsize=(20,5))
# Use white grid plot background from seaborn
#sns.set(font_scale=1.2, style="ticks")
# sns.set_theme() # reset to default
plt.subplot(1, 2, 1)
grid = sns.boxplot(data=dfc, y='LenderYield', x='Term', hue='Term')
grid.set_ylim([-0.2,0.5])
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('LinderYield by Term Boxplot')
grid.set(xlabel='Term', ylabel='LinderYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(1, 2, 2)
grid = sns.boxplot(data=dfc, y='EstimatedReturn', x='Term', hue='Term')
grid.set_ylim([-0.2,0.5])
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('EstimatedReturn by Term Boxplot')
grid.set(xlabel='Term', ylabel='EstimatedReturn');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
# let's look at the Term distribution for loans
# We can see the gap in the middle splitting the data into two periods,
# This plot confirm the above finding, that pre Jul 2009 data has only 36 Term loans. 12 months and 60 months Term loans are introduced post Jul 2009.
plt.figure(figsize=(15,5))
# Use white grid plot background from seaborn
sns.set(font_scale=1.5, style="whitegrid")
# Create an array with the colors you want to use
colors = ['#fc8d62', '#66c2a5', '#8da0cb']
# Set your custom color palette
sns.set_palette(sns.color_palette(colors))
dfc['y_m'] = dfc['LoanDate'].dt.strftime('%Y-%m')
grid = sns.countplot(data=dfc.sort_values(by='y_m'), x='y_m', hue='Term')#, palette=palette
grid.tick_params(axis='x', labelrotation=90, labelsize=10)
grid.set_title("Loan Date Distribution by Term", fontsize=14)
grid.set_xlabel("Loan dates", fontsize=12)
grid.set_ylabel("Count of loans", fontsize=12)
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.show();
# # let's look further at the Term distribution for loans
# We can see that the 12 months term loans have been introduced at the same time as 60 months term loans Nov. 2010,
# but the 12 months term loans only present in the data until in Apr. 2013
# has the 12 months term loans been discontinued? or are there missing data?
plt.figure(figsize = [25, 10])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.4, wspace=0.4) #control gap between subplots
palette = sns.color_palette("deep", 3) #Paired
dfc['y_m'] = dfc['LoanDate'].dt.strftime('%Y-%m')
ax3 = plt.subplot(2, 2, 3)
grid = sns.countplot(data=dfc[dfc['Term'] == 60].sort_values(by='y_m'), x='y_m', hue='Term')#, palette=palette
grid.tick_params(axis='x', labelrotation=90, labelsize=10)
grid.set_title("Loan Date Distribution by Term = 60 months", fontsize=14)
grid.set_xlabel("Loan dates", fontsize=12)
grid.set_ylabel("Count of loans", fontsize=12)
sns.move_legend(grid, "upper left", bbox_to_anchor=(0, 1));
ax1 = plt.subplot(2, 2, 1)
grid = sns.countplot(data=dfc[dfc['Term'] == 12].sort_values(by='y_m'), x='y_m', hue='Term')#, palette=palette
grid.tick_params(axis='x', labelrotation=90, labelsize=10)
grid.set_title("Loan Date Distribution by Term = 12 months", fontsize=14)
grid.set_xlabel("Loan dates", fontsize=12)
grid.set_ylabel("Count of loans", fontsize=12)
sns.move_legend(grid, "upper left", bbox_to_anchor=(0, 1));
ax1.sharex(ax3) # make 12 and 60 plots share x-axis to show start and end points
ax2 = plt.subplot(2, 2, 2)
grid = sns.countplot(data=dfc[dfc['Term'] == 36].sort_values(by='y_m'), x='y_m', hue='Term')#, palette=palette
grid.tick_params(axis='x', labelrotation=90, labelsize=10)
grid.set_title("Loan Date Distribution by Term = 36 months", fontsize=14)
grid.set_xlabel("Loan dates", fontsize=12)
grid.set_ylabel("Count of loans", fontsize=12)
sns.move_legend(grid, "upper left", bbox_to_anchor=(0, 1));
# let's plot the pre Jul 2009 data alone just to confirm again that it contains only 36 months loans
plt.figure(figsize=(15,5))
dfc_old['y_m'] = sorted(dfc_old['LoanDate'].map(lambda x: x.strftime('%Y-%m')))
grid = sns.countplot(data=dfc_old, x='y_m', hue='Term')
grid.tick_params(axis='x', labelrotation=90, labelsize=10)
grid.axes.set_title("Loan Date Distribution", fontsize=14)
grid.set_xlabel("Loan dates", fontsize=12)
grid.set_ylabel("Count of loans", fontsize=12)
sns.move_legend(grid, "upper left", bbox_to_anchor=(0, 1));
plt.show();
# obtain the colours for the above plot, so that to make 12 months loans noticable with a bright colour.
print(sns.color_palette("Set2", 3).as_hex())
sns.color_palette("Set2", 3).as_hex()
# Given our finding above, How does LenderYield for pre Jul 2009 vary according to CreditGrade and CreditScoreRangeLower?
# The plots shows the general trend of lower CreditScoreRangeLower values are associated with the lower CreditGrade values and visa versa, which
# cause LendeYield to increase on the x-axis when moving from the highest CreditGrade AA to the next lower one. This can be observed by observing CreditGrade
# AA having darker Blue colour bars then gradually fainting out as CreditGrade decreases, and the finally moving to stronger tones of Orange Colour as we reach the bottom
# end of the scale.
# But this plot does not explain the two peaks at 0.28 and 3.4
bins = np.arange(0,0.4,0.01)
plt.figure(figsize = [20, 10])
# plt.subplots_adjust(top = 0.5, bottom=0.01, hspace=.5, wspace=0.2) #control gap between subplots
sns.set(font_scale=1) # change plots individual title size.
palette = sns.color_palette('coolwarm_r',26) # use reverse palette to keep lower values marked with orangey/redish colour
grid = sns.FacetGrid(data=dfc_old, col='CreditGrade', hue='CreditScoreRangeLower', col_wrap=4, margin_titles=True, height=3, aspect= 1.33, palette=palette)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.histplot, 'LenderYield', bins=bins, alpha=0.5); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.9)
grid.fig.suptitle('Histogram Distribution of LenderYield by CreditGrade and CreditScoreRangeLower - pre Jul 2009!', fontsize=14);
grid.add_legend();
dfc_old['Term'].value_counts()
# Given our finding above, How does LenderYield for pre Jul 2009 vary according to LisitingCagtegory and CreditScoreRangeLower?
# The peak at 0.28 is present in the "Not Available" ListingCategory, whereas the 0.34 is present across all categories.
# The peak is present across multiple values in the bottom half of the CreditScoreRangeLower values
# We still don't have a definitive explaination for the o.28 and 0.34 peaks.
# But we can see an over all trend where the LenderYield increased as the CreditScoreRangeLower values decreased in the scale, for each category, which is
# observed in the gradual shift from darker blue on the very left hand side to darker orange on the very right hand side.
palette = sns.color_palette("coolwarm_r", 26) # use reverse palette to keep lower values marked with orangey/redish colour
bins = np.arange(0,0.4,0.01)
plt.figure(figsize = [20, 10])
# plt.subplots_adjust(top = 0.5, bottom=0.01, hspace=.5, wspace=0.2) #control gap between subplots
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_old, col='ListingCategory', hue='CreditScoreRangeLower', col_wrap=4, margin_titles=True, height=3, aspect= 1.33, palette=palette)
grid.map(sns.histplot, 'LenderYield', bins=bins, alpha=0.5); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.9)
grid.fig.suptitle('Histogram Distribution of LenderYield by Listingcategories and CreditScoreRangeLower - pre Jul 2009!', fontsize=14);
grid.add_legend();
# Given our finding above, How does LenderYield for pre Jul 2009 vary according to ListingCategory and CreditGrade?
# The very same trend of LenderYield increasing as CreditGrade decreased is noticable again, but we can see much greater correlation between
# LenderYield increasing as CreditGrade decreased in the scale than CreditScoreRangeLower. This is observed in the Blue occupying the left hand side and then
# moving in the colour scale towards Orange on the right hand side. This times showing a much more segemented plot between Blue and Orange than CreditScoreRangeLower.
# The plots shows the two peaks 0.28 and 0.34 associated with the lower CreditGrade values E, HR and also NC
# We still don't have a clear cut answer as to the difference between those two peaks and what determines 0.28 and 0.34 LenderYield values for loans.
# But this is good enough for this analysis. This route will not been followed any further.
palette = sns.color_palette("coolwarm", 8) # use coolwarm palette to keep lower values marked with orangey/redish colour
bins = np.arange(0,0.4,0.01)
plt.figure(figsize = [20, 10])
# plt.subplots_adjust(top = 0.5, bottom=0.01, hspace=.5, wspace=0.2) #control gap between subplots
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_old, col='ListingCategory', hue='CreditGrade', col_wrap=4, margin_titles=True, height=3, aspect= 1.33, palette = palette)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.histplot, 'LenderYield', bins=bins, alpha=0.5); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.9)
grid.fig.suptitle('Histogram Distribution of LenderYield by Listingcategories and CreditGrade - pre Jul 2009!', fontsize=14);
grid.add_legend();
# inspecting actual number for CreditGrade
dfc_old['CreditGrade'].value_counts()
# Given our finding above, let's finally look at how does LenderYield for post Jul 2009 vary according to ListingCategory and 'ProsperRating (Alpha)'?
# The very same trend of LenderYield increasing as CreditGrade or CreditScoreRangeLower decreased in pre Jul 2009 data is noticable again here.
# LenderYield increases as ProsperRating (Alpha) decreased in the scale. This is observed in the Blue occupying the left hand side and then
# moving in the colour scale towards Orange on the right hand side. This times showing a much more segemented plot between Blue and Orange than previous plots.
# The plots shows again the peak at 0.3 associated with the lower CrediProsperRating (Alpha) rating values HR and E.
palette = sns.color_palette("coolwarm", 7) # use coolwarm palette to keep lower values marked with orangey/redish colour
bins = np.arange(0,0.4,0.01)
plt.figure(figsize = [20, 10])
# plt.subplots_adjust(top = 0.5, bottom=0.01, hspace=.5, wspace=0.2) #control gap between subplots
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='ListingCategory', hue='ProsperRating (Alpha)', col_wrap=4, margin_titles=True, height=3, aspect= 1.33, palette = palette)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.histplot, 'LenderYield', bins=bins, alpha=0.5); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.9)
grid.fig.suptitle('Histogram Distribution of LenderYield by Listingcategories and ProsperRating (Alpha) - post Jul 2009!', fontsize=14);
grid.add_legend();
# inspecting actual number for Term
dfc['Term'].value_counts()
# inspecting actual number for PercentFunded
df['PercentFunded'].value_counts()
# What could be causing the different bands in some of the duo relationships?
# Which thrid variable is responsible for this banding?
# Those bands can be observed in the BorrowerRate vs EstimatedLoss, BorrowerRate vs EstimatedEffectiveYield, BorrowerRate vs EstimatedReturn
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by Loan Term, which may help explain the different data groupings in the plot.
# The plots shows that some of the variance observed can be explained by the loan Term,
# The 12 and 60 months Loans have a narrow almost linear spread of points,
# Whilest the 36 months Term loans have a wide spread with Multiple sub groups of data points.
# The 36 months loans seem to be responsible for the highest loss, and lowest EstimatedReturn and EstimatedEffectiveYield data points compared to the 12 and 60 months Term loans!
# But what are those subgroups? and what other variables can explain the reminder of the variance?
plt.figure(figsize = [25, 10])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.4, wspace=0.4) #control gap between subplots
palette = sns.color_palette("deep", 3) #Paired
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='LenderYield', hue = 'Term', palette=palette) # , style='LoanStatus'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs LenderYield by Term - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 2)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedLoss', hue = 'Term', palette=palette) # , style='LoanStatus'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedLoss by Term - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 3)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedEffectiveYield', hue = 'Term', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedEffectiveYield by Term - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 4)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedReturn', hue = 'Term', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedReturn by Term - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by ProsperRating (Alpha), which may help explain the different data groupings in the plot too.
# The plot shows that the different ProsperRating (Alpha) ratings explain the different bands observed in our data.
# Also the plots showed that higher ProsperRating (Alpha) has lower EstimatedLoss, higher EstimatedEffectiveYield, and higher EstimatedReturn.
# Also, As the ProsperRating (Alpha) decreased to lower ratings, the BorrowerRate and EstimatedLoss moved to higher rates, whilest EstimatedEeffectiveYield
# and EstimatedReturn moved to lower lower rates.
# The plot still shows the presence of sub layers of groupings within those groups that manifests itself by the adjacent parrellel lines within the same colour.
# What other variables could be contributing to the presence of those sub groups?
plt.figure(figsize = [25, 15])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.3, wspace=0.2) #control gap between subplots
palette = sns.color_palette("Paired", 7)
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='LenderYield', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs LenderYield - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 2)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedLoss', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedLoss - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedEffectiveYield', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedEffectiveYield - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedReturn', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedReturn - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)'); # increase ncol by 1 for title to show
plt.show();
dfc_new.groupby(['LoanStatus'])['LoanKey'].count()#.keys()
statuses = ['Completed', 'Current', 'Defaulted', 'Chargedoff',
'Cancelled', 'FinalPaymentInProgress',
'Past Due (1-15 days)', 'Past Due (16-30 days)',
'Past Due (31-60 days)', 'Past Due (61-90 days)',
'Past Due (91-120 days)', 'Past Due (>120 days)']
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by ProsperRating (Alpha) for LoanStatus = 'Current'
# What stands out is that Current loans have a linear relationship between the duo variables in the plots and the horizontal bands of data are removed by the Current loans filter.
# This suggests a different algorithm for calculating Yield, Loss, and Return is used for more recent loans vs historical loans or some type of loans are stopped in more recent loan algo changes.
# The plot demonistrates that Current Loans still constitute all levels of ProsperRating (Alpha) inlcuding High Risk (HR) loans.
# In other words; lower ProsperRatings and HR loans are not stopped in more recent changes.
plt.figure(figsize = [25, 15])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.3, wspace=0.2) #control gap between subplots
palette = sns.color_palette("Paired", 7)
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='LenderYield', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs LenderYield - Current Loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='EstimatedLoss', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedLoss - Current Loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='EstimatedEffectiveYield', hue = 'ProsperRating (Alpha)', palette=palette)#, markers=m1, style='ProsperScore' # ,x_jitter=4, y_jitter=4 dont work
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedEffectiveYield - Current Loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['EstimatedReturn'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='EstimatedReturn', hue = 'ProsperRating (Alpha)', palette=palette)#, markers=m1, style='ProsperScore'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedReturn - Current Loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)'); # increase ncol by 1 for title to show
plt.show();
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by ProsperRating (Alpha) for LoanStatus != 'Current'
# The plot demonistrates that Loans
plt.figure(figsize = [25, 15])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.3, wspace=0.2) #control gap between subplots
palette = sns.color_palette("Paired", 7)
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='LenderYield', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs LenderYield - None Current Loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='EstimatedLoss', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedLoss - None Current Loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='EstimatedEffectiveYield', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedEffectiveYield - None Current Loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['EstimatedReturn'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='EstimatedReturn', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedReturn - None Current Loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)'); # increase ncol by 1 for title to show
plt.show();
import random
def jitter(x):
#print(random.uniform(0, 0.02) -0.01)
return x + random.uniform(0, 0.02) -.01
# adding some jitter to the scatter plot manually! because ,x_jitter=4, y_jitter=4 in sns.scatterplot dont work
dfc_new['EstimatedLoss_jitter'] = dfc_new['EstimatedLoss'].apply(lambda x: jitter(x))
dfc_new['EstimatedReturn_jitter'] = dfc_new['EstimatedReturn'].apply(lambda x: jitter(x))
dfc_new['EstimatedEffectiveYield_jitter'] = dfc_new['EstimatedEffectiveYield'].apply(lambda x: jitter(x))
dfc_new['LenderYield_jitter'] = dfc_new['LenderYield'].apply(lambda x: jitter(x))
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by Loan Status to see all loanStatuses in one plot, instead of doing them one by one like previous two plots.
# The plots shows that current loans are associated with lower Estimatedloss, and higher EstimatedReturn and EstimatedEffectiveYield compared to other statuses.
# This may indicate a change in those metrics calculation or the way loans are allocated.
from matplotlib import cm
import random
palette = sns.color_palette("terrain", 12)
plt.figure(figsize = [25, 10])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.4, wspace=0.4) #control gap between subplots
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='LenderYield_jitter', hue = 'LoanStatus', palette=palette) # , style='LoanStatus'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs LenderYield by LoanStatus - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedLoss_jitter', hue = 'LoanStatus', palette=palette) # , style='LoanStatus'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedLoss by LoanStatus - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedEffectiveYield_jitter', hue = 'LoanStatus', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedEffectiveYield by LoanStatus - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedReturn_jitter', hue = 'LoanStatus', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedReturn by LoanStatus - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by LoanStatus = 'Current' and Term to see what differences are there between the different Terms for current loans.
# It is interesting to see that 12, 36, and 60 months Terms translate in the same (equivalent) linear fashion, however we can see that EstimatedLoss is highest for 12 months Term,
# then a bit lower for 36 months Term, then lowest for 60 months Term, which Translates for better profitability for 60 months Term, then 36 months Term, and lowest profitability
# to the 12 months Term.
from matplotlib import cm
import random
palette = sns.color_palette("Paired", 3)
#palette = cm.rainbow(np.linspace(0, 1, 13))
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.figure(figsize = [25, 10])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.4, wspace=0.4) #control gap between subplots
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='LenderYield_jitter', hue='Term', palette=palette) # , style='LoanStatus' ListingCategory
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs LenderYield by LoanStatus - Current loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='EstimatedLoss_jitter', hue='Term', palette=palette) # , style='LoanStatus' ListingCategory
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedLoss by LoanStatus - Current loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='EstimatedEffectiveYield_jitter', hue='Term', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedEffectiveYield by LoanStatus - Current loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='EstimatedReturn_jitter', hue='Term', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedReturn by LoanStatus - Current loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by LoanStatus != 'Current' and Term to see what differences are there between the different Terms for current loans.
# We can see those None current loans having similar picture to the above for both the 12 months and 60 months loans, whereas the 36 months loans taking
# a wider range of loss and Return thus occupying the least and most profitable loans in the dataset.
from matplotlib import cm
import random
palette = sns.color_palette("Paired", 3) # LoanStatus 12 #ListingCategory 20
#palette = cm.rainbow(np.linspace(0, 1, 13))
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.figure(figsize = [25, 10])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.4, wspace=0.4) #control gap between subplots
plt.subplot(2, 2, 1)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='LenderYield_jitter', hue = 'Term', palette=palette) # , style='LoanStatus'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs LenderYield by LoanStatus - None Current loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 2)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='EstimatedLoss_jitter', hue = 'Term', palette=palette) # , style='LoanStatus'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedLoss by LoanStatus - None Current loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 3)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='EstimatedEffectiveYield_jitter', hue = 'Term', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedEffectiveYield by LoanStatus - None Current loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 4)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='EstimatedReturn_jitter', hue = 'Term', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedEffectiveYield by LoanStatus - None Current loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by ProsperScore to see if also has an impact on how those bands are classified.
# The pattern observed with ProsperRating (Alpha) is repeated here, where "generally speaking" the lowest end of the scale dominates
# one end of the plot, and gradually increasing until reaching the highest end of the scale at the other side of the plot.
# Also, (admittedly it is a bit hard to see!) ignroing the lowest level that takes the blue band by itself, we can see that each two adjacent levels of the ProsperScore
# dominate one band in the plot generally speaking.
# let's try a divergent palette in the next plot.
plt.figure(figsize = [25, 15])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.3, wspace=0.2) #control gap between subplots
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='LenderYield', hue = 'ProsperScore', palette="tab20")
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs LenderYield - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='ProsperScore')
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedLoss', hue = 'ProsperScore', palette="tab20")
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedLoss - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='ProsperScore')
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedEffectiveYield', hue = 'ProsperScore', palette="tab20")
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedEffectiveYield - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='ProsperScore')
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['EstimatedReturn'].notna()], x='BorrowerRate', y='EstimatedReturn', hue = 'ProsperScore', palette="tab20")
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedReturn - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='ProsperScore');
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by ProsperScore to see if also has an impact on how those bands are classified.
# trying a divergent palette in the plot from the preivous cell, we can see the horizontal bands moving from blue at the bottom gradually to red at the top in
# accordance with the scale from highest to lowest.
# The hard to see diagonal swarm of data in the middle seems to follow a similar pattern where the bottom part is in blue corresponding
# to the highest level of the scale and then gradually moving upwards towards the lowest level in the scale represented in red.
plt.figure(figsize = [25, 15])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.3, wspace=0.2) #control gap between subplots
palette = sns.color_palette("coolwarm_r", 11)
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='LenderYield', hue = 'ProsperScore', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs LenderYield - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='ProsperScore')
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedLoss', hue = 'ProsperScore', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedLoss - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='ProsperScore')
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedEffectiveYield', hue = 'ProsperScore', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedEffectiveYield - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='ProsperScore')
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['EstimatedReturn'].notna()], x='BorrowerRate', y='EstimatedReturn', hue = 'ProsperScore', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedReturn - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='ProsperScore');
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by LoanStatus (Current vs other statuses) to see if they help explain the diagonal vs horizontal bands of data
# None-Current loans still exhibit the horizontal bands of data and lower density diagonal middle swarm.
plt.figure(figsize = [25, 15])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.3, wspace=0.2) #control gap between subplots
palette = sns.color_palette("coolwarm_r", 11)
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='LenderYield', hue = 'ProsperScore', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs LenderYield - None Current - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='ProsperScore')
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='EstimatedLoss', hue = 'ProsperScore', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedLoss - None Current - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='ProsperScore')
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='EstimatedEffectiveYield', hue = 'ProsperScore', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedEffectiveYield - None Current - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='ProsperScore')
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['EstimatedReturn'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='EstimatedReturn', hue = 'ProsperScore', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedReturn - None Current - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='ProsperScore');
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by LoanStatus (Current vs other statuses) to see if they help explain the diagonal vs horizontal bands of data
# Interestingly the horizontal bands of data disappeared for Current LoanStatus while the diagonal middle swarm is the only relationship present,
# possibly suggesting differences in these metrics occured over time!
m1 = ['_','1','1','+','+','3','3','x','x',5,5]
plt.figure(figsize = [25, 15])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.3, wspace=0.2) #control gap between subplots
palette = sns.color_palette("coolwarm_r", 11)
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='LenderYield', hue = 'ProsperScore', palette=palette)#, markers=m1, style='ProsperScore'#, kwrags={edgecolor:'w'}
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs LenderYield - Current loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='EstimatedLoss', hue = 'ProsperScore', palette=palette)#, markers=m1, style='ProsperScore'#, kwrags={edgecolor:'w'}
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedLoss - Current loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='EstimatedEffectiveYield', hue = 'ProsperScore', palette=palette)#, markers=m1, style='ProsperScore' # ,x_jitter=4, y_jitter=4 dont work
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedEffectiveYield - Current loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['EstimatedReturn'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='EstimatedReturn', hue = 'ProsperScore', palette=palette)#, markers=m1, style='ProsperScore'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedReturn - Current loans - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperScore'); # increase ncol by 1 for title to show
plt.show();
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# let's add an fourth variable in the mix, Term, for the above plots to see if it helps explain some of the relationships we see.
# There was no need to add the markers as an extra fourth variable, because it added clutter that is hard to decode.
m1 = ['_','1','1','+','+','3','3','x','x',5,5]
plt.figure(figsize = [25, 15])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.3, wspace=0.2) #control gap between subplots
palette = sns.color_palette("coolwarm_r", 11)
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='LenderYield', hue = 'ProsperScore', style='Term', palette=palette)#, markers=m1, style='ProsperScore'#, kwrags={edgecolor:'w'}
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs LenderYield - None Current - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='EstimatedLoss', hue = 'ProsperScore', style='Term', palette=palette)#, markers=m1, style='ProsperScore'#, kwrags={edgecolor:'w'}
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedLoss - None Current - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='EstimatedEffectiveYield', hue = 'ProsperScore', style='Term', palette=palette)#, markers=m1, style='ProsperScore' # ,x_jitter=4, y_jitter=4 dont work
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedEffectiveYield - None Current - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['EstimatedReturn'].notna() & (dfc_new['LoanStatus'] != 'Current')], x='BorrowerRate', y='EstimatedReturn', hue = 'ProsperScore', style='Term', palette=palette)#, markers=m1, style='ProsperScore'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedReturn - None Current - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperScore'); # increase ncol by 1 for title to show
plt.show();
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# let's add an fourth variable in the mix, Term, for the above plots to see if it helps explain some of the relationships we see.
# There was no need to add the markers as an extra fourth variable, because it added clutter that is hard to decode.
m1 = ['_','1','1','+','+','3','3','x','x',5,5]
plt.figure(figsize = [25, 15])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.3, wspace=0.2) #control gap between subplots
palette = sns.color_palette("coolwarm_r", 11)
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='LenderYield_jitter', hue = 'ProsperScore', style='Term', palette=palette)#, markers=m1, style='ProsperScore'#, kwrags={edgecolor:'w'}
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs LenderYield - Current - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='EstimatedLoss', hue = 'ProsperScore', style='Term', palette=palette)#, markers=m1, style='ProsperScore'#, kwrags={edgecolor:'w'}
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedLoss - Current - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='EstimatedEffectiveYield', hue = 'ProsperScore', style='Term', palette=palette)#, markers=m1, style='ProsperScore' # ,x_jitter=4, y_jitter=4 dont work
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedEffectiveYield - Current - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperRating (Alpha)')
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['EstimatedReturn'].notna() & (dfc_new['LoanStatus'] == 'Current')], x='BorrowerRate', y='EstimatedReturn', hue = 'ProsperScore', style='Term', palette=palette)#, markers=m1, style='ProsperScore'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedReturn - Current - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=8, title='ProsperScore'); # increase ncol by 1 for title to show
plt.show();
# how does the post Jul 2009 CreditScore look like and which has higher representation of borrowers?
# CreditScoreRangeLower and CreditScoreRangeUpper have a perfect linear relationship. No variations detected for those levels.
# Therefore in analysis we can use one of them to represent this scoring metric.
# worth noting that most common loans are given to mid lower Scores as can be seen by the circule size.
#reset theme
sns.set_theme()
dfcounted = dfc_new.groupby(['CreditScoreRangeLower', 'CreditScoreRangeUpper']).size().reset_index(name='Count')
grid = sns.scatterplot(data=dfcounted[dfcounted['CreditScoreRangeLower'].notna()], x='CreditScoreRangeLower', y='CreditScoreRangeUpper', size='Count', color='deepskyblue')
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('Scatterplot of CreditScore Lower vs Upper ranges - Post Jul 2009')
grid.set(xlabel='Lower Credit Range', ylabel='Upper Credit Range');
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by CreditScoreRangeLower to see if also has an impact on how those bands are classified.
# There is no clear trend easy to splot of how those CreditScoreRangeLower are mapped in the plot probably due to the large number of levels present in this variables,
# but very careful observation of the plot reveals overall trend of lower CreditScoreRangeLower dominating one side (upper) of the scatterplot while the upper
# CreditScoreRangeLower values dominating the other side (lower) of the scatterplot.
# There is a very difficult to spot hint in the plot, suggesting CreditScoreRangeLower is influencing the sub lines within the horizontal sub bands
# (very closely adjacent horizontal lines)
# The overall trend may be better represented by choosing a diverging pallete.
plt.figure(figsize = [25, 15])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.3, wspace=0.2) #control gap between subplots
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='LenderYield_jitter', hue = 'CreditScoreRangeLower', palette="tab20")
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs LenderYield - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='CreditScoreRangeLower')
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedLoss', hue = 'CreditScoreRangeLower', palette="tab20")
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedLoss - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='CreditScoreRangeLower')
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedEffectiveYield', hue = 'CreditScoreRangeLower', palette="tab20")
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedEffectiveYield - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='CreditScoreRangeLower')
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['EstimatedReturn'].notna()], x='BorrowerRate', y='EstimatedReturn', hue = 'CreditScoreRangeLower', palette="tab20")
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedReturn - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='CreditScoreRangeLower');
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by CreditScoreRangeLower to see if also has an impact on how those bands are classified using a diverging palette.
# Choosing a divering palette shows the gradual transitions in the scatterplot colours marrying up with the levels present in CreditScoreRangeLower
# for both the horizontal bands and the diagonal swarm of data.
plt.figure(figsize = [25, 15])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.3, wspace=0.2) #control gap between subplots
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='LenderYield', hue = 'CreditScoreRangeLower', palette="vlag_r")
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs LenderYield - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='CreditScoreRangeLower')
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedLoss', hue = 'CreditScoreRangeLower', palette="vlag_r")
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedLoss - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='CreditScoreRangeLower')
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedEffectiveYield', hue = 'CreditScoreRangeLower', palette="vlag_r")
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedEffectiveYield - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='CreditScoreRangeLower')
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['EstimatedReturn'].notna()], x='BorrowerRate', y='EstimatedReturn', hue = 'CreditScoreRangeLower', palette="vlag_r")
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedReturn - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='CreditScoreRangeLower');
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets Plot the relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss for LoanStatus = 'Chargedoff'
# Contrary to our expectation, Chargedoff seems to have loans from across the spectrum of ProsperRating (Alpha) and not just High Risk (HR) loans!
from matplotlib import cm
import random
palette = sns.color_palette("Set2", 7)
#palette = cm.rainbow(np.linspace(0, 1, 13))
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.figure(figsize = [25, 10])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.4, wspace=0.4) #control gap between subplots
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['LoanStatus'].isin(['Chargedoff'])], x='BorrowerRate', y='LenderYield_jitter', hue = 'ProsperRating (Alpha)', palette=palette, ) # ,x_jitter=4, y_jitter=4 dont work
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs LenderYield by LoanStatus - Chargedoff')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['LoanStatus'].isin(['Chargedoff'])], x='BorrowerRate', y='EstimatedLoss_jitter', hue = 'ProsperRating (Alpha)', palette=palette, ) # ,x_jitter=4, y_jitter=4 dont work
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedLoss by LoanStatus - Chargedoff')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['LoanStatus'].isin(['Chargedoff'])], x='BorrowerRate', y='EstimatedReturn_jitter', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedReturn by LoanStatus - Chargedoff')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['LoanStatus'].isin(['Chargedoff'])], x='BorrowerRate', y='EstimatedEffectiveYield_jitter', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedEffectiveYield by LoanStatus - Chargedoff')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets Plot the relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss for LoanStatus = 'Defaulted'
# Contrary to our expectation, Defaulted seems to have loans from across the spectrum of ProsperRating (Alpha) and not just High Risk (HR) loans!
from matplotlib import cm
import random
palette = sns.color_palette("Set2", 7)
#palette = cm.rainbow(np.linspace(0, 1, 13))
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.figure(figsize = [25, 10])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.4, wspace=0.4) #control gap between subplots
plt.subplot(2, 2, 1)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['LoanStatus'].isin(['Defaulted'])], x='BorrowerRate', y='LenderYield_jitter', hue = 'ProsperRating (Alpha)', palette=palette) # , style='LoanStatus'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs LenderYield by LoanStatus - Defaulted')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 2)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['LoanStatus'].isin(['Defaulted'])], x='BorrowerRate', y='EstimatedLoss_jitter', hue = 'ProsperRating (Alpha)', palette=palette) # , style='LoanStatus'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedLoss by LoanStatus - Defaulted')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 3)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['LoanStatus'].isin(['Defaulted'])], x='BorrowerRate', y='EstimatedReturn_jitter', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedReturn by LoanStatus - Defaulted')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 4)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() & dfc_new['LoanStatus'].isin(['Defaulted'])], x='BorrowerRate', y='EstimatedEffectiveYield_jitter', hue = 'ProsperRating (Alpha)', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedEffectiveYield by LoanStatus - Defaulted')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
# sense checking the data and variables
dfc_new[dfc_new['LoanStatus'].isin(['Defautled','Chargedoff'])]
# sense checking the entire data columns/variables to pick next analysis route!
dfc_liner = dfc.loc[dfc['EstimatedReturn'] == dfc['EstimatedEffectiveYield']]
pd.set_option('display.max_columns', None)
display(dfc_liner)
# count loanStatus for colour palette generation.
dfc['LoanStatus'].value_counts()
dfc['year'].unique()
len(dfc['year'].unique())
# generates palette for all 10 years.
colors = sns.color_palette('hls', len(dfc['year'].unique()))
palette = {mode: color for mode, color in zip(dfc['year'].unique(), colors)}
palette
# Need to resort dfc['year'] to remove years from pre Jul 2009 data, given it does not have our metrics of interest (LenderReturn and EstimatedEffectiveYield)
# So that we avoid the palette wrong number of colours error in the next plot, which uses dfc_new dataframe for post Jul 2009 loans only.
# convert some variables into ordered categorical types so that they appear in a logical order during analysis and when plotting:
def sort_categorical_vars_post_jul2009():
ordinal_var_dict = {
'year': ['2009','2010','2011','2012','2013','2014']
#'col2': ['J', 'I', 'H', 'G', 'F', 'E', 'D'],
}
for var in ordinal_var_dict:
pd_ver = pd.__version__.split(".")
if (int(pd_ver[0]) > 0) or (int(pd_ver[1]) >= 21): # pandas v0.21 or later
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[var])
dfc_new[var] = dfc_new[var].astype(ordered_var)
else: # pandas pre-v0.21
dfc_new[var] = dfc_new[var].astype('category', ordered = True,
categories = ordinal_var_dict[var])
sort_categorical_vars_post_jul2009()
# recalcualting the year categorical order to be limited to post Jul 2009 years generates the required number of years and palatte colours
dfc_new['year'].unique()
# recalcualting the year categorical order to be limited to post Jul 2009 years generates the required number of years and palatte colours
len(dfc_new['year'].unique())
# recalcualting the year categorical order to be limited to post Jul 2009 years generates the required number of years and palatte colours
colors = sns.color_palette('hls', len(dfc_new['year'].unique()))
palette = {mode: color for mode, color in zip(dfc_new['year'].unique(), colors)}
palette
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by Year to see if also has an impact on how those bands are classified
# The plot below heighlights that the bands present in the relationship between those variables is present in the 2009 and 2010 years only.
# This may support the hypothesis that a possible change in the algorithm in how loan Yield, costs, loss, and ultimately Return is calculated
# has happened from year 2011 onwards.
# Did the company move towards a linear relationship and removed negative losses from its calculation?
# Is the algorithm vetting out more HR loans in later years compared to 2009 and 2020?
from distinctipy import distinctipy
# get distinct well differentiated colours equal to number of years.
N=len(dfc_new['year'].unique().dropna())
colours = distinctipy.get_colors(N) #in seaborn palette=colors
plt.figure(figsize = [25, 15])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.3, wspace=0.2) #control gap between subplots
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='LenderYield', hue = 'year', palette=colours)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs LenderYield - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='Year');
plt.subplot(2, 2, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedLoss_jitter', hue = 'year', palette=colours)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedLoss - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='Year')
plt.subplot(2, 2, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedEffectiveYield_jitter', hue = 'year', palette=colours)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedEffectiveYield - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='Year')
plt.subplot(2, 2, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna() ], x='BorrowerRate', y='EstimatedReturn_jitter', hue = 'year', palette=colours)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerRate vs EstimatedReturn - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
# Put a legend below current axis
grid.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7, title='Year');
dfc_new['year'].unique()
# The relahionship between EstimateReturn and EstimatedEffectiveYield would be interesting to
# investigate a bit further, specially that it encapsulates substractioin of EstimatedLoss in the relationship between them.
# lets try to segement the plot by ProsperScore, ProsperRating (Alpha), Term, Year, and CreditScoreRangeLower to see if we can glean any more insights.
# The inspection of the plots below has made a few things clearer:
# Plots 4 and 5 show the linear line in the middle is for 36 months loans which were the only loan Term in 2009 and 2010 years.
# Plot 5 shows 2013/2014 metrics relationship seem to have a narrower (smaller variance) compared to 2011/2012. This needs further analysis.
# There are two distinct groups of data; one near perfect digonal line and another that another oval swarm of data points.
# The near perfect digonal line below and bands of losses in previous plots suggests that metrics in 2009 and 2010 years were calculated using algorithms that
#Â fixed losses (horizontal bands) mainly according to (ProsperRating (Alpha), ProsperScore, and CreditScoreRange Lower/Upper) which maintained a linear relationship
# between EstimatedEffectiveYield and EstimateReturn, but also resulted into negative EstimatedReturn for some of the High Risk (HR) loans.
# Wheresas the oval swarm of data points for years 2011 onwards, suggests that metrics were calculate using algorithms that
# used a more variable calculation of losses based on factors such as (ProsperRating (Alpha), ProsperScore, and CreditScoreRange Lower/Upper)
# which resulted into less linear / larger variance direct positive relationship between EstimatedEffectiveYield and EstimateReturn without negative EstimatedReturn.
from distinctipy import distinctipy
N=len(dfc_new['year'].dropna().unique())
colours_y = distinctipy.get_colors(N) #in seaborn palette=colors
N=len(dfc_new['CreditScoreRangeLower'].dropna().unique())
colours_s = distinctipy.get_colors(N)
plt.figure(figsize = [20, 10])
plt.subplots_adjust(top = 1.4, bottom=0.01, hspace=0.3, wspace=0.2) #control gap between subplots
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 3, 1)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='EstimatedEffectiveYield', y='EstimatedReturn', palette="coolwarm_r")#, style='ProsperRating (Alpha)'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('EstimatedEffectiveYield vs EstimatedReturn - post Jul 2009')
grid.set(xlabel='EstimatedEffectiveYield', ylabel='EstimatedReturn');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
plt.subplot(2, 3, 2)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='EstimatedEffectiveYield_jitter', y='EstimatedReturn', hue = 'ProsperScore', palette="coolwarm_r")#, style='ProsperRating (Alpha)'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('EstimatedEffectiveYield vs EstimatedReturn - post Jul 2009')
grid.set(xlabel='EstimatedEffectiveYield', ylabel='EstimatedReturn');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
grid.legend(title='ProsperScore', loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=4);
plt.subplot(2, 3, 3)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='EstimatedEffectiveYield_jitter', y='EstimatedReturn', hue = 'ProsperRating (Alpha)', palette="coolwarm")#, style='ProsperRating (Alpha)'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('EstimatedEffectiveYield vs EstimatedReturn - post Jul 2009')
grid.set(xlabel='EstimatedEffectiveYield', ylabel='EstimatedReturn');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
grid.legend(title='ProsperRating (Alpha)', loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=4);
plt.subplot(2, 3, 4)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='EstimatedEffectiveYield_jitter', y='EstimatedReturn', hue = 'Term', palette="coolwarm")#, style='ProsperRating (Alpha)'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('EstimatedEffectiveYield vs EstimatedReturn - post Jul 2009')
grid.set(xlabel='EstimatedEffectiveYield', ylabel='EstimatedReturn');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
grid.legend(title='Term', loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=7);
plt.subplot(2, 3, 5)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='EstimatedEffectiveYield_jitter', y='EstimatedReturn', hue = 'year', palette=colours_y, alpha=0.7)#, style='ProsperRating (Alpha)' # palette="deep"
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('EstimatedEffectiveYield vs EstimatedReturn - post Jul 2009')
grid.set(xlabel='EstimatedEffectiveYield', ylabel='EstimatedReturn');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
grid.legend(title='year', loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=4);
plt.subplot(2, 3, 6)
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='EstimatedEffectiveYield_jitter', y='EstimatedReturn', hue = 'CreditScoreRangeLower', palette="coolwarm_r")#palette="colours_s", style='ProsperRating (Alpha)', palette="cubehelix"
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('EstimatedEffectiveYield vs EstimatedReturn - post Jul 2009')
grid.set(xlabel='EstimatedEffectiveYield', ylabel='EstimatedReturn');
# Shrink current axis's height by 10% on the bottom
box = grid.get_position()
grid.set_position([box.x0, box.y0 + box.height * 0.1,
box.width, box.height * 0.9])
grid.legend(title='CreditScoreRangeLower', loc='upper center', bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=3);
colours_y
# confrim 12 and 60 Temrs introduced in 2011. very small numbers in 2010, which are probably negligible
dfc_new.groupby(['year', 'Term'])['Term'].count()
# confrim 12 and 60 Temrs introduced in 2011. very small numbers in 2010, which are probably negligible
def annotate_bars(ax=None, fmt='.0f', **kwargs): #'.2f' ',f'
ax = plt.gca() if ax is None else ax
# add labels to bars
for p in ax.patches:
ax.annotate('{{:{:s}}}'.format(fmt).format(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()),
xytext=(0, 5),textcoords='offset points',
ha='center', va='center', **kwargs)
p.set_edgecolor('white') # white is default, but can change to red to make small bars more visible!
plt.figure(figsize = [20, 10])
sns.set_theme()
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='Term', col_wrap=4, margin_titles=True, height=3, aspect= 1.33)
grid.map(sns.countplot, 'year', color='deepskyblue', alpha=1.0);
grid.map(annotate_bars, fmt=',.0f', fontsize=10, color='k') #='.2g'
grid.fig.subplots_adjust(top=0.8)
grid.fig.suptitle('Count of Loans per year by Term - post Jul 2009!', fontsize=14);
grid.add_legend();
# The relahionship between borrowerRate and EstimateReturn, EstimatedEffectiveYield, and EstimatedLoss shows groups of data points
# What are those groups of data related to?
# lets try to segement the data by Periods, which may help explain the different data groupings in the plot.
# The plots shows that some of the variance observed can be explained by the Periods,
# The horziontal bands are present in the 2009-2010 period, whiles in the 2011-2014 period it takes a diagonal swarm of data.
# The 2009-2010 have fixed EstimatedLoss values causing those horizontal bands of data to appear. Whereas, 2011-2014 have a incremental EstimatedLoss causing a diagonal swarm of data
plt.figure(figsize = [25, 10])
plt.subplots_adjust(top = 0.8, bottom=0.01, hspace=0.4, wspace=0.4) #control gap between subplots
palette = sns.color_palette("deep", 2) #Paired
# Use white grid plot background from seaborn
sns.set(font_scale=1.2, style="ticks")
plt.subplot(2, 2, 1)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='LenderYield', hue = 'yearSplit', palette=palette) # , style='LoanStatus'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs LenderYield by Term - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='LenderYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 2)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedLoss', hue = 'yearSplit', palette=palette) # , style='LoanStatus'
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedLoss by Term - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedLoss');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 3)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedEffectiveYield', hue = 'yearSplit', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedEffectiveYield by Term - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedEffectiveYield');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
plt.subplot(2, 2, 4)
#dfc_new['ListingCategory'].value_counts()
grid = sns.scatterplot(data=dfc_new[dfc_new['BorrowerRate'].notna()], x='BorrowerRate', y='EstimatedReturn', hue = 'yearSplit', palette=palette)
grid.tick_params(axis='x', labelrotation=90)
grid.set_title('BorrowerAPR vs EstimatedReturn by Term - post Jul 2009')
grid.set(xlabel='BorrowerRate', ylabel='EstimatedReturn');
sns.move_legend(grid, "upper left", bbox_to_anchor=(1, 1));
# Faceting the BorrowerRate vs EstimatedLoss, EstimatedEffectiveyield, and EstimatedReturn relationships by year
# This shows the Estimated Loss different calcualtions changes from fixed bands with very high variance in 2009, and 2010 towards a more linear (small variance) relationship
# in 2011, 2012, 2013 and then finally a nearly perfect linear relationship in 2014.
import patchworklib as pw
pw.overwrite_axisgrid()
sns.set_theme()
sns.set_style("darkgrid")#, {"axes.spines.left": True,'axes.spines.bottom':True}
grid1 = sns.FacetGrid(data=dfc_new, col='year', hue = 'ProsperRating (Alpha)', col_wrap=6, palette="coolwarm", margin_titles=False, height=3, aspect= 1.33)
grid1.map(sns.scatterplot, 'BorrowerRate' , 'LenderYield', alpha=0.3);
grid1.fig.subplots_adjust(top=0.8)
grid1.fig.suptitle('LenderYield vs BorrowerRate by Year - post Jul 2009!', fontsize=24, x=8, y=15)
grid1.set_axis_labels("", "LenderYield")
grid1 = pw.load_seaborngrid(grid1, label="g1")
grid2 = sns.FacetGrid(data=dfc_new, col='year', hue = 'ProsperRating (Alpha)', col_wrap=6, palette="coolwarm", margin_titles=False, height=3, aspect= 1.33)
grid2.map(sns.scatterplot, 'BorrowerRate' , 'EstimatedLoss', alpha=0.3);
grid2.set_axis_labels("", "EstimatedLoss")
grid2.set_titles("")
#showsplines(grid2)
grid2 = pw.load_seaborngrid(grid2, label="g2")
grid3 = sns.FacetGrid(data=dfc_new, col='year', hue = 'ProsperRating (Alpha)', col_wrap=6, palette="coolwarm", margin_titles=False, height=3, aspect= 1.33)
grid3.map(sns.scatterplot, 'BorrowerRate' , 'EstimatedEffectiveYield', alpha=0.3)
grid3.set_axis_labels("", "EstimatedEffectiveYield")
grid3.set_titles("")
grid3.add_legend()
# showsplines(grid3)
grid3 = pw.load_seaborngrid(grid3, label="g3")
grid4 = sns.FacetGrid(data=dfc_new, col='year', hue = 'ProsperRating (Alpha)', col_wrap=6, palette="coolwarm", margin_titles=False, height=3, aspect= 1.33)
grid4.map(sns.scatterplot, 'BorrowerRate' , 'EstimatedReturn', alpha=0.3);
grid4.set_titles("")
# showsplines(grid4)
grid4 = pw.load_seaborngrid(grid4, label="g4")
sns.set_style("darkgrid")
(grid1/grid2/grid3/grid4)
# To see the above yearly changes, let's facetGrid the plot by year
# This plot clearly shows the shift from negative metrics in 2009 and 2010 to positive metrics in 2011-2014.
# It also shows the move from wide variance in 2011, 2012, and 2013 to a linear (small variance) in 2014.
# Also, there was a lot of overlap in earlier years (2009 and 2010) between lower ProsperRating (Alpha) and higher ProsperRating (Alpha) levels for EstimatedReturn.
# This overlap continued narrow down until year 2014 where the overlap has been reduced to approximately zero due to the sequential linar relationship the
# ProsperRating (Alpha) levels are made to move along.
plt.figure(figsize = [20, 10])
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='year', hue = 'ProsperRating (Alpha)', col_wrap=6, palette="coolwarm", margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.scatterplot, 'EstimatedEffectiveYield', 'EstimatedReturn' , alpha=0.3); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.8)
grid.fig.suptitle('EstimatedReturn vs EstimatedEffectiveYield by Year - post Jul 2009!', fontsize=14);
grid.add_legend();
# Intorducing a bit of jitter on the previous plot, makes the high ProsperRating (Alpha) ponts clear in 2009 and 2010
plt.figure(figsize = [20, 10])
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='year', hue = 'ProsperRating (Alpha)', col_wrap=6, palette="coolwarm", margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.scatterplot, 'EstimatedEffectiveYield_jitter', 'EstimatedReturn_jitter' , alpha=0.3); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.8)
grid.fig.suptitle('EstimatedReturn vs EstimatedEffectiveYield by Year - post Jul 2009!', fontsize=14);
grid.add_legend();
# Faceting the EstimatedLoss vs BorrowerRate relationship by year
# This shows the Estimated Loss different calcualtions changes from fixed bands with very high variance in 2009, and 2010 towards a more linear (small variance) relationship
# in 2011, 2012, 2013 and then finally a nearly perfect linear relationship in 2014.
plt.figure(figsize = [20, 10])
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='year', hue = 'ProsperRating (Alpha)', col_wrap=6, palette="coolwarm", margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.scatterplot, 'BorrowerRate' , 'EstimatedLoss', alpha=0.3); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.8)
grid.fig.suptitle('EstimatedLoss vs BorrowerRate by Year - post Jul 2009!', fontsize=14);
grid.add_legend();
# Faceting the EstimatedLoss vs BorrowerRate relationship by year
# This shows the Estimated Loss different calcualtions changes from fixed bands with very high variance in 2009, and 2010 towards a more linear (small variance) relationship
# in 2011, 2012, 2013 and then finally a nearly perfect linear relationship in 2014.
plt.figure(figsize = [20, 10])
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='year', hue = 'ProsperRating (Alpha)', col_wrap=6, palette="coolwarm", margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.scatterplot, 'EstimatedLoss', 'EstimatedReturn' , alpha=0.3); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.8)
grid.fig.suptitle('EstimatedLoss vs EstimatedReturn by Year - post Jul 2009!', fontsize=14);
grid.add_legend();
# let's facet EstimatedEffectiveYield vs EstimatedReturn by year and ProsperRating (Alpha) to see the trends of each ProspectRating (Alpha) change by year.
# This plot supports the observations made in previous plots. In ear
plt.figure(figsize = [20, 10])
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='year', row="ProsperRating (Alpha)", hue = 'ProsperRating (Alpha)', palette="coolwarm", margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.scatterplot, 'EstimatedEffectiveYield', 'EstimatedReturn' , alpha=0.5); # remove binwidth if you want auto generated different bin widths for each plot.
grid.fig.subplots_adjust(top=0.95)
grid.fig.suptitle('EstimatedReturn vs EstimatedEffectiveYield by Year and ProsperRating (Alpha) - post Jul 2009!', fontsize=14);
grid.add_legend();
# let's facet EstimatedEffectiveYield vs EstimatedReturn by year and LoanStatus and segment plot by ProsperRating (Alpha)
# This plot supports the observations made in previous explorations, where all levels from ProsperRating (Alpha) are present in all LoanStatuses
# Defaulted, Chargedoff, and Past Due Date have loans classifed with high ProsperRating (Alpha) all the way to low ProsperRating (Alpha), which is not what was expected.
# It was expected that Chagedoff and Defaulted would be dominated by very low PropserRating (Alpha) level e.g. HR and E.
from distinctipy import distinctipy
N=len(dfc_new['ProsperRating (Alpha)'].dropna().unique())
#colors = distinctipy.get_colors(N) #in seaborn palette=colors
colors = sns.color_palette("coolwarm", N)
plt.figure(figsize = [20, 10])
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='year', row="LoanStatus", hue = 'ProsperRating (Alpha)', palette="coolwarm", margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.scatterplot, 'EstimatedEffectiveYield_jitter', 'EstimatedReturn_jitter' , edgecolor='none' , alpha=0.5); #edgecolor='w'
grid.fig.subplots_adjust(top=0.95)
grid.fig.suptitle('EstimatedReturn vs EstimatedEffectiveYield by Year and Loan Status - post Jul 2009!', fontsize=14);
grid.add_legend();
# Let's change the shape from solid dots to circles to try to and improve visibility of covered /plotted over points.
# It is a bit better and clearer to see blue and orange dots for 2009 and 201o years. We can see that the above observations are confirmed again here.
from distinctipy import distinctipy
N=len(dfc_new['ProsperRating (Alpha)'].dropna().unique())
#colors = distinctipy.get_colors(N) #in seaborn palette=colors
colors = sns.color_palette("coolwarm", N)
plt.figure(figsize = [20, 10])
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=dfc_new, col='year', row="LoanStatus", hue = 'ProsperRating (Alpha)', palette="coolwarm", hue_kws={"edgecolor": colors}, margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.scatterplot, 'EstimatedEffectiveYield_jitter', 'EstimatedReturn_jitter' , facecolor="none", lw=2 , alpha=0.7); #edgecolor='w'
grid.fig.subplots_adjust(top=0.95)
grid.fig.suptitle('EstimatedReturn vs EstimatedEffectiveYield by Year and Loan Status - post Jul 2009!', fontsize=14);
grid.add_legend();
colors
[dfc['LoanKey'].nunique(), dfc['ListingKey'].nunique(),
dfc['LoanKey'].count(), dfc['ListingKey'].count(),
dfc['LoanKey'].size, dfc['ListingKey'].size]
# let's group the data and calculate the percent of LoanStatus for each year to that year total
grouped = dfc_new.groupby(['year','LoanStatus'])['ListingKey'].count().reset_index().rename(columns={'ListingKey':'count'})
grouped['subTotal']=grouped.groupby(['year'])['count'].transform('sum')
grouped['percent'] = grouped['count']/grouped['subTotal']
pd.reset_option('display.max_rows')
grouped.head(30)
# lets plot the above table for visual inspection
# The decline in Completed and increasse Current LoanStatuses makes sense as years increase.
# What is interesting is Defualted and Chargedoff are both showing a decline from 2012 onwards. However, we can't attribute this positive change to the new calculations
# intorudced. It can take a few years for the effect of those change to show up specially that majority of loans are 36 months or longer.
# One would have to execute statistical experiments to contorl for variables of interest and gain statistical confidence for effect of changes in algorithms.
plt.figure(figsize = [20, 10])
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=grouped, col='LoanStatus', col_wrap=4, palette="coolwarm", margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.lineplot, 'year', 'percent' , marker='o', alpha=0.9);
grid.fig.subplots_adjust(top=0.9)
grid.fig.suptitle('year vs ProsperRating (Alpha) percent - post Jul 2009!', fontsize=14);
grid.add_legend();
#sns.lineplot(data=grouped.query("LoanStatus='Chargedoff'"), x="year", y="percent")
# let's segement the data further by ProsperRating (Alpha) to explore changes at ProsperRating (Alpha) lower ratings
grouped = dfc_new.groupby(['year','LoanStatus','ProsperRating (Alpha)'])['ListingKey'].count().reset_index().rename(columns={'ListingKey':'count'})
grouped['subTotal']=grouped.groupby(['year','LoanStatus'])['count'].transform('sum')
grouped['percent'] = grouped['count']/grouped['subTotal']
pd.reset_option('display.max_rows')
grouped.head(30)
# let's plot the data table above for visual inspection
# Clearly the Defaulted, Chargedoff, and Past Due Date overall show that lower ProsperRating (Alpha) levels have higher chances of defualting, chargeoff, and missing payments.
# This is observed by the red/orange lines having coming above the blue lines in the Defaulted, Chargedoff, and Past Due Date plots.
# Current loans, shows 2010 having higher levels of ProsperRating (Alpha) present, but 2011 and 2012 shoing higher percent of lower ProsperRating (Alpha) levels present,
# then 2013 and 2014 moving towards higher presence of higher ProsperRating (Alpha) levels in the data.
# There seems to be a slight trend where percent of higher ProsperRating (Alpha) levels dropping down in the Defaulted and Chargedoff statuses plot for latter years. Which
# indicates some improvements have been achieved in correctly predicting loans statuses and future losses.
# The trends are not very strong, and while we can see some improvements were achieved in the latter years in correctly predicting loan statuses and future losses,
# there still remains a level of variance in the plots, which makes it hard to be certain, and it remains to be verified by getting extracts for 3-4
# more years after 2014 to observe the trend better.
# The observations also needs to be verified by running statistical AB testing programme against the changes vs a control group to see if new changes have made
# positive impact at what level of certainty.
plt.figure(figsize = [20, 10])
sns.set(font_scale=1) # change plots individual title size.
grid = sns.FacetGrid(data=grouped, col='LoanStatus', col_wrap=4, hue = 'ProsperRating (Alpha)', palette="coolwarm", margin_titles=True, height=3, aspect= 1.33)# control facitgrid graph widths: margin_titles=True, height=3, aspect= 1.33
grid.map(sns.lineplot, 'year', 'percent' , marker='o', alpha=0.9);
grid.fig.subplots_adjust(top=0.9)
grid.fig.suptitle('year vs ProsperRating (Alpha) percent - post Jul 2009!', fontsize=14);
grid.add_legend();
# What is the average EstimatedReturn per year? and does it show any change over time?
# lets make sum calcs first:
grouped = dfc_new.groupby(['year']).agg({'EstimatedReturn':[np.mean, np.median]}).reset_index()
grouped.columns = ['_'.join(tup).rstrip('_') for tup in grouped.columns.values]
grouped
# What is the average EstimatedReturn per year? and does it show any change over time?
# lets plot the above table for visual inspection
# The average EstimatedReturn increased to a peak value of 0.115 in year 2011, but it declined since then to reach 0.072 in 2014.
plt.figure(figsize = [10, 5])
sns.set(font_scale=1) # change plots individual title size.
gmean = sns.lineplot(data=grouped, x="year", y="EstimatedReturn_mean", label = 'Mean')
gmedian = sns.lineplot(data=grouped, x="year", y="EstimatedReturn_median", label = 'Median')
plt.xlabel("Year")
plt.ylabel("EstimatedReturn Mean / Median Scale")
plt.title("EstimatedReturn Mean and Median per Year - post Jul 2009")
plt.legend() # neeeds , label = 'Mean' added to plot code to work!
plt.show();
Again when plotting Histogram Distribution of LenderYield by Listingcategories and CreditGrade, the very same trend is noticable and also shows much greater correlation between LenderYield increasing as CreditGrade decreased in the scale than CreditScoreRangeLower.
When plotting BorrowerAPR vs EstimatedLoss by Term it revealed two groups of data. One shows a horizontal bands of data that seems to be predomenatly 36 months Term Loans, and one that is a diagonal swarm of data that seem to be a mix of 12, 36, and 60 months Term loans. This relationship manifested itself in the BorrowerAPR vs EstimatedEffectiveYield, and EstimatedReturn plots too albiet in a slightly different arrangmenet but still clearly showing two groups of data relationships one solely related to 36 months loans, and one related a mix of 12, 36, and 60 months loans.
When plotting BorrowerAPR vs EstimatedLoss, EstimatedEffectiveYield, and EstimatedReturn by ProsperRating (Alpha), it showed a very clear set of horizontal bands each related to a ProsperRating (Alpha) grade, which showed that Estimated Return is higher for higher ProsperRating (Alpha) values. Also the plots showed that higher ProsperRating (Alpha) has lower EstimatedLoss, higher EstimatedEffectiveYield, and higher EstimatedReturn. Also, As the ProsperRating (Alpha) decreased to lower ratings, the BorrowerRate and EstimatedLoss moved to higher rates, whilest EstimatedEeffectiveYield and EstimatedReturn moved to lower lower rates.
When segmenting the data by Current loans and None Current loans and repeating the above plot, it showed that current loans are characterised solely by the diagonal swarm of data, whereas the None Current loans seemed to be characterised by the horizontal bands of data. Which suggests that different algorithms were depolyed for measuring current loans vs historical loans.
Segmenting the above plot by Term, revealed that 60 months Terms had Lower EstimatedLoss and highest Estimated Return, whereas 12 months Terms showed highest EstimatedLoss and Lowest Estimated Return. The 36 months Terms had EstimatedLoss and EstimatedReturn on average sat between the 12 months and 60 months Term loans.
ProsperScore also showed similar influnce on the BorrowerAPR vs EstimatedLoss, EstimatedEffectiveYield, and EstimatedReturn set of plots, which was more noticable when using a divergent palette that moved from darker blue hues on the left handside towards darker red hue on the right hand side.
Wheng categorising the BorrowerAPR vs EstimatedLoss, EstimatedEffectiveYield, and EstimatedReturn plots by CreditScoreRangeLower a similar picture to ProsperRating (Alpha) and ProsperScore emerged, where lower higher values showed lower EstimatedLoss, higher EstimatedEffectiveYield, and Higher EstimatedReturn for the same BorrowerRate. As Lower CreditScoreRangeLower are dominated by higher BorrowerRate and higher Estimated Loss, Lower EstimatedEffectiveYield, and Lower EstimatedReturn.
When segmenting the BorrowerAPR vs EstimatedLoss, EstimatedEffectiveYield, and EstimatedReturn plots by LoanStatus Chargedoff and Default, All levels of ProsperRating (Alpha) were present and not just the lower grade rated borrower's loans.
When segmenting the BorrowerAPR vs EstimatedLoss, EstimatedEffectiveYield, and EstimatedReturn plots by Year of loan, it showed that 2009 and 2010 loans had different spread than the 2011 to 2014 years.
Plotting EstimatedReturn vs EstimatedEeffectiveYield and segmenting it by different variables, showed that same previous observation where years 2009 and 2010 loans had different spread than the years 2011 to 2014. This trend was reflected on the Terms, where 12 and 60 months Term Loans which was introduced mainly from 2011 onwards showed different spread than the 36 months Term Loans. Inspecting the categorised ProsperRating (Alpha) plot shows higher rated borrowers' loans occupying the middle of the charts, and the lower rated borrowers' loans some making negative Returns and some making positive high Returns for yearss 2009 and 2010, but always making highest Returns for years 2011-2014.
Plotting the EstimatedReturn vs EstimatedEffectiveYield plot and and grouping by colour of Year, showed the change of spread overtime, where 2009 and 2010 moved in a linear fashion from negative to positive Returns level. 2011 onwards showed the trends move from low positive Returns levels dominated by higher rated loans to high Returns levels domainted by lower rated loans. Whereas 2011-2013 years trends moved in a swarm of moderate variance, the 2014 year moved in a very linear (diagonal line) fashion.
The EstimatedLoss vs BorrowerRate plot and EstimatedLoss vs EstimatedReturn plot confirmed the same relationship between the key features described in the previous point.
When looking at the rate of LoanStatuses per year it was difficult to determine if the yearly differences made any difference on reducing chargedoff or deafulted loans. This is mainly due to the length of loan Terms 12, 36, and 60 months, which require a larger window into the future to be able to analyse such impact. However, we can see that the average EstimatedReturn increased to a peak value of 0.115 in year 2011, it declined since then to reach 0.082 in 2014.
When plotting the Histogram Distribution of LenderYield faceted by ProsperRating (Alpha) and coloured by Term, revealed the large peak of LenderYield at 0.3 is related to 36 months Term Loans that are labelled High Risk (HR) and level E.
The interaction between BorrowerAPR vs EstimatedLoss, EstimatedEffectiveYield, and EstimatedReturn in the plots when segmeneted by ProsperRating (Alpha) revealed very interesting bandings that correlated perfectly with the ProsperRating (Alpha) levels.
Plotting the EstimatedReturn vs EstimatedEffectiveYield plot and and grouping by colour of Year revealed major differences in features relationships between different years. E.g. 2009 and 2010 years have negative Returns loans, whereas 2011-2014 had positive Return loans only. Also 2009 and 2010 years has perfctly linear Returns relationship with lots of overlap between ProsperRating (Alpha) bands, whereas 2011-2013 has a swarm of direct positive Return relationship with some overlap between ProsperRating (Alpha) bands, but 2014 moving to a perfectly linear Returns relationship with no overlap between the diffferent ProsperRating (Alpha) bands.
The average EstimatedReturn increased to a peak value of 0.115 in year 2011, it declined since then to reach lowest value every of 0.073 in 2014. It is unclear how the EstiamtedReturn is correlated to the actual Return for loans, and if those changes are algorithmic changes to better estimate actual Returns. Further Analysis of payments made on loans is required to answer some of those questions.
36 months Term Loans are the domenant type of Loans in the dataset. 12 months and 60 months Term Loans seem to be introduced in Nov 2010. 12 months Loans seem to be stopped in May 2013.
Ignoring Other and No Available categories, Debit Consolidation was the overall dominant ListingCategory (62.8% post Jul 2009 and 17% pre Jul 2009). Home Improvement and Business came next (8% and 6% post Jul 2009).
Lender Yield increased in post Jul 2009 data vs pre Jul 2009 data for categories which are present in both datasets, except for Debt Consolidation, which decrease slightly in post Jul 2009 vs pre Jul 2009.
LenderYield increases as CreditScoreRangeLower and CreditGrade worsened in pre Jul 2009 dataset, and as ProsperRating (Alpha), ProsperScore, and CreditScoreRangeLower worsened in post Jul 2009 data for all ListingCategories.
ProsperRating (Alpha) High Risk (HR) and E rated loans are associated with a large peak of number of loans at 0.3 lenderYield.
EstimatedLoss, and EstimatedEffectiveYield are characterised by very distinct bands of data which are highly correlated to the ProsperRating (Alpha) levels.
On average EstimatedReturn is highest for 60 months Term loans, then 36 months Term loans taking middle levels, finally 12 months Term loans has the lowest Estimatedreturn when considering post Nov 2010, where the three loan Terms are present.
Different EstimatedReturns vs EstimatedEffectiveYield relationships were noticed for different years:
The changes from year to year did not seem to show any immediate stark change in Chargedoff and Defaulted loans statuses percent of total loans. An extract of dates with longer date range (3-5 extra years) into the future (i.e. to year 2019) is need to observe the performance of those loans and be able to deduct if the observed yearly differences impacted Chargedof and Defaulted. Interestingly, the average EstimatedReturn increased to a peak value of 0.115 in year 2011, but declined since then to reach lowest value of 0.082 in year 2014 and we don't really know how this correlates with actual Returns back from loans.
LenderYield, EstimatedEffectiveYield, and EstimatedReturn were picked as our features of interest and we wanted to understand how those vary according ListingCateogry, CreditGrade, CreditScoreRangeLower, and ProsperRating (Alpha), Having a houser, and IncomeRange.
The analysis started by carrying univariate analysis of those variables to understand their distribution, their statistical summaries, and the characterstics of loans in the dataset. For example it was interesting to see Debt Consolidation being the domenant ListingCategory by large and that all Categories seems to have similar distributions of the LenderYield.
Then I moved on to carrying out Bivariate analysis of the two variables against each other to try and uncover dual relationship between a number of variables segmented by others features in the data set. Some interesting observations were uncovered such as LenderYield gone up in post Jul 2009 data for all ListingCategories present in the two periods vs pre Jul 2009 data except for Debt Consolidation, which went down slightly. Also, we observed ProsperRating (Alpha) HR level was awarded 36 months Term only. Also we were able to see that 12 and 60 months Terms loans were introduced in late Nov 2010, and 12 months loans were stopped in Apr 2013. carrying out a pair plot helped us spot some interesting relationships between features/variables in the dataset such as BorrowerAPR vs EstimatedLoss, EstimatedEffectiveYield, and EstimatedReturn, which formed the basis for the multivariate analysis.
Then I moved on to carry out Multivariate Analysis on the dataset, further segmenting the plots by a third and sometime a fourth variable to try and glean insights into how the relationships are formed. For example facet grids, and adding hue or shape dimensions to the plot allowed for interesting inisghts to be revealed. For example, were able to spot that the a large peak of EstimatedReturn at 0.3 was solely related to 36 months Term loans that are labelled High Risk (HR). Alos, it was possilbe to see that the different bands observed in the BorrowerAPR vs EstimatedLoss, BorrowerAPR vs EstimatedEffectiveYield, and BorrowerAPR vs EstimatedReturn was mainly due to ProsperRating (Alpha) levels. Using divergent palettes, it was also possible to see that CreditScoreRangeLower and CreditGrade correlated with and had similar impact to ProsperRating (Alpha) on those relationships. Also, it was observed that over the years the calculation for Loan Returns seems to have gone changes, where in years 2009 and 2010 have a straight direct relationship with BorrowerRate starting that starts from negative EstimatedReturn for HR loans, Years 2011, 2012, ans 2013 moved to a swarm of diagonal data that have direct relationship with more variance per BorrowerRate values and that has positive EstimatedReturns only, and year 2014 moved to a direct linear relationship between BorrowerRate and EstimatedReturn that has only positive EstimatedReturns.
At the end I measured the rate of each LoanStatus for each year to try and see if those yearly changes correlated with improvements such as reducing percentage of chargedoff or defaulted loans. When looking at the Blue (high ProsperRating (Alpha)) and Red lines (low ProsperRating (Alpha)) in the plots, it was possible to see some level of improvement where blue lines declined in the ChargedOff and Defaulted plots with slight worsening observed in the slight increase of Blue lines for 2014 in the ChargedOff, Defaulted, and past due date plots. In the absence of strong signals it is defualt to take a verdict and to Truely measure improvement or worsening in performance, one need statistical analysis done on the data for example by using AB Testing.
Remove all Tips mentioned above, before you convert this notebook to PDF/HTML
At the end of your report, make sure that you export the notebook as an html file from the
File > Download as... > HTML or PDFmenu. Make sure you keep track of where the exported file goes, so you can put it in the same folder as this notebook for project submission. Also, make sure you remove all of the quote-formatted guide notes like this one before you finish your report!